1 of 12

BrainBench

LLM Evaluation Dashboard

Milestone 4 Progress Report

Orion Powers & Daniella Seum

Faculty Advisor: Dr. Khaled Slhoub

Florida Institute of Technology • February 23, 2026

2 of 12

Project Overview

Goal

Evaluate multiple LLMs using a standardized, reproducible testing pipeline to provide users with clear, unbiased comparisons of model capabilities — focusing on free and locally hosted models.

Reasoning

Evaluation

Math accuracy across 3 difficulty levels

Performance

Metrics

Response time, consistency, resource usage

Web

Dashboard

Interactive Angular site with Chart.js visuals

3 of 12

Models & Datasets

Local LLMs Under Evaluation

qwen3:4b

4B parameters • Alibaba Cloud

gemma3:4b

4B parameters • Google DeepMind

phi3:3.8b

3.8B parameters • Microsoft

Math Evaluation Datasets

8th Grade Math

Foundational

Calculus I

Intermediate

Adv. Probability & Stats

Advanced

3 iterations per model × 3 datasets = 27 total evaluation runs

4 of 12

Milestone 4 — Progress Overview

Run All Datasets Through LLMs (×3)

75%

Interpret Correctness Results

75%

Collect & Interpret Statistics

75%

Begin Website Development

50%

All tasks split 50/50 between team members

5 of 12

Task 1: LLM Evaluation Runs

Accomplished

Selected 3 comparable models (~4B params each) for fair benchmarking
Ran all 3 datasets through each model across 3 iterations
Identified and fixed pipeline parsing/verification failures
Validated all outputs are saving and logging correctly

Key Obstacle

Verification pipeline had parsing failures on decimal precision and symbolic expressions in the Calculus dataset — required pipeline update and redeployment.

Remaining Work

Longer-running models are finishing their final passes. The finalized pipeline is deployed and all processed outputs have been verified — just needs runtime to complete.

75%

6 of 12

Pipeline Demo

7 of 12

Tasks 2 & 3: Analysis & Statistics

Correctness Analysis

Per-category accuracy rates computed across all 3 math domains
Cross-model comparison reveals architectural and training differences (not just size)
Consistency analysis: stability of results across 3 iterations per model
Final report pending completion of remaining runs

Performance Statistics

Average response time per problem, model, and dataset category
Verification confidence scores — how reliably models format answers
Timing comparisons meaningful due to comparable parameter counts
Aggregation and export infrastructure already in place

Both tasks at 75% completion

8 of 12

Task 4: Website Development

Angular

TypeScript

Chart.js

Visualization

Dashboard

Components

Completed

Angular project structure set up
Core dashboard component scaffolding
Chart.js integration for bar charts and line graphs
Visual layout built with mock data from partial results

Next Steps

Automated pipeline-to-website data integration
Replace mock data with live results
Expand analytical features and comparison views
Polish UI for public-facing presentation

50%

9 of 12

Website Demo

https://p-orion.github.io/

10 of 12

Team Contributions

Orion Powers

Led website development — Angular project setup, TypeScript scaffolding, and component architecture
Integrated Chart.js for model comparison visualizations (bar charts, line graphs)
Built visual layout and component logic using mock data from partial results
Collaborated on cross-model correctness and performance analysis

Daniella Seum

Led pipeline debugging — diagnosed parsing and verification failures in edge cases
Updated extraction and normalization logic for decimal precision and symbolic expressions
Re-validated fixes and redeployed corrected pipeline across all models
Collaborated on cross-model correctness and consistency analysis

11 of 12

Milestone 5 Plan

Due March 30, 2026

Complete All LLM Runs

Finish remaining batch runs with the finalized pipeline across all 3 models and datasets. Perform final validation pass.

Finalize Website

Transition from mock data to live pipeline results. Complete remaining dashboard views and automated data integration.

System Evaluation & Analysis

Comprehensive analysis of accuracy, consistency, timing trends. Evaluate pipeline effectiveness and synthesize findings.

Senior Design Poster

Design and produce showcase poster covering goals, methodology, architecture, and key benchmarking results.

All tasks split 50/50 between Orion and Daniella

12 of 12

Thank You

Questions?

Orion Powers • opowers2023@my.fit.edu

Daniella Seum • dseum2023@my.fit.edu

Faculty Advisor: Dr. Khaled Slhoub