1 of 12

BrainBench

LLM Evaluation Dashboard

Milestone 4 Progress Report

Orion Powers & Daniella Seum

Faculty Advisor: Dr. Khaled Slhoub

Florida Institute of Technology • February 23, 2026

2 of 12

Project Overview

Goal

Evaluate multiple LLMs using a standardized, reproducible testing pipeline to provide users with clear, unbiased comparisons of model capabilities — focusing on free and locally hosted models.

Reasoning

Evaluation

Math accuracy across 3 difficulty levels

Performance

Metrics

Response time, consistency, resource usage

Web

Dashboard

Interactive Angular site with Chart.js visuals

3 of 12

Models & Datasets

Local LLMs Under Evaluation

qwen3:4b

4B parameters • Alibaba Cloud

gemma3:4b

4B parameters • Google DeepMind

phi3:3.8b

3.8B parameters • Microsoft

Math Evaluation Datasets

8th Grade Math

Foundational

Calculus I

Intermediate

Adv. Probability & Stats

Advanced

3 iterations per model × 3 datasets = 27 total evaluation runs

4 of 12

Milestone 4 — Progress Overview

Run All Datasets Through LLMs (×3)

75%

Interpret Correctness Results

75%

Collect & Interpret Statistics

75%

Begin Website Development

50%

All tasks split 50/50 between team members

5 of 12

Task 1: LLM Evaluation Runs

Accomplished

  • Selected 3 comparable models (~4B params each) for fair benchmarking
  • Ran all 3 datasets through each model across 3 iterations
  • Identified and fixed pipeline parsing/verification failures
  • Validated all outputs are saving and logging correctly

Key Obstacle

Verification pipeline had parsing failures on decimal precision and symbolic expressions in the Calculus dataset — required pipeline update and redeployment.

Remaining Work

Longer-running models are finishing their final passes. The finalized pipeline is deployed and all processed outputs have been verified — just needs runtime to complete.

75%

6 of 12

Pipeline Demo

7 of 12

Tasks 2 & 3: Analysis & Statistics

Correctness Analysis

  • Per-category accuracy rates computed across all 3 math domains
  • Cross-model comparison reveals architectural and training differences (not just size)
  • Consistency analysis: stability of results across 3 iterations per model
  • Final report pending completion of remaining runs

Performance Statistics

  • Average response time per problem, model, and dataset category
  • Verification confidence scores — how reliably models format answers
  • Timing comparisons meaningful due to comparable parameter counts
  • Aggregation and export infrastructure already in place

Both tasks at 75% completion

8 of 12

Task 4: Website Development

Angular

TypeScript

Chart.js

Visualization

Dashboard

Components

Completed

  • Angular project structure set up
  • Core dashboard component scaffolding
  • Chart.js integration for bar charts and line graphs
  • Visual layout built with mock data from partial results

Next Steps

  • Automated pipeline-to-website data integration
  • Replace mock data with live results
  • Expand analytical features and comparison views
  • Polish UI for public-facing presentation

50%

9 of 12

Website Demo

10 of 12

Team Contributions

Orion Powers

  • Led website development — Angular project setup, TypeScript scaffolding, and component architecture
  • Integrated Chart.js for model comparison visualizations (bar charts, line graphs)
  • Built visual layout and component logic using mock data from partial results
  • Collaborated on cross-model correctness and performance analysis

Daniella Seum

  • Led pipeline debugging — diagnosed parsing and verification failures in edge cases
  • Updated extraction and normalization logic for decimal precision and symbolic expressions
  • Re-validated fixes and redeployed corrected pipeline across all models
  • Collaborated on cross-model correctness and consistency analysis

11 of 12

Milestone 5 Plan

Due March 30, 2026

01

Complete All LLM Runs

Finish remaining batch runs with the finalized pipeline across all 3 models and datasets. Perform final validation pass.

02

Finalize Website

Transition from mock data to live pipeline results. Complete remaining dashboard views and automated data integration.

03

System Evaluation & Analysis

Comprehensive analysis of accuracy, consistency, timing trends. Evaluate pipeline effectiveness and synthesize findings.

04

Senior Design Poster

Design and produce showcase poster covering goals, methodology, architecture, and key benchmarking results.

All tasks split 50/50 between Orion and Daniella

12 of 12

Thank You

Questions?

Orion Powers • opowers2023@my.fit.edu

Daniella Seum • dseum2023@my.fit.edu

Faculty Advisor: Dr. Khaled Slhoub