BrainBench
LLM Evaluation Dashboard
Milestone 4 Progress Report
Orion Powers & Daniella Seum
Faculty Advisor: Dr. Khaled Slhoub
Florida Institute of Technology • February 23, 2026
Project Overview
Goal
Evaluate multiple LLMs using a standardized, reproducible testing pipeline to provide users with clear, unbiased comparisons of model capabilities — focusing on free and locally hosted models.
Reasoning
Evaluation
Math accuracy across 3 difficulty levels
Performance
Metrics
Response time, consistency, resource usage
Web
Dashboard
Interactive Angular site with Chart.js visuals
Models & Datasets
Local LLMs Under Evaluation
qwen3:4b
4B parameters • Alibaba Cloud
gemma3:4b
4B parameters • Google DeepMind
phi3:3.8b
3.8B parameters • Microsoft
Math Evaluation Datasets
8th Grade Math
Foundational
Calculus I
Intermediate
Adv. Probability & Stats
Advanced
3 iterations per model × 3 datasets = 27 total evaluation runs
Milestone 4 — Progress Overview
Run All Datasets Through LLMs (×3)
75%
Interpret Correctness Results
75%
Collect & Interpret Statistics
75%
Begin Website Development
50%
All tasks split 50/50 between team members
Task 1: LLM Evaluation Runs
Accomplished
Key Obstacle
Verification pipeline had parsing failures on decimal precision and symbolic expressions in the Calculus dataset — required pipeline update and redeployment.
Remaining Work
Longer-running models are finishing their final passes. The finalized pipeline is deployed and all processed outputs have been verified — just needs runtime to complete.
75%
Pipeline Demo
Tasks 2 & 3: Analysis & Statistics
Correctness Analysis
Performance Statistics
Both tasks at 75% completion
Task 4: Website Development
Angular
TypeScript
Chart.js
Visualization
Dashboard
Components
Completed
Next Steps
50%
Website Demo
Team Contributions
Orion Powers
Daniella Seum
Milestone 5 Plan
Due March 30, 2026
01
Complete All LLM Runs
Finish remaining batch runs with the finalized pipeline across all 3 models and datasets. Perform final validation pass.
02
Finalize Website
Transition from mock data to live pipeline results. Complete remaining dashboard views and automated data integration.
03
System Evaluation & Analysis
Comprehensive analysis of accuracy, consistency, timing trends. Evaluate pipeline effectiveness and synthesize findings.
04
Senior Design Poster
Design and produce showcase poster covering goals, methodology, architecture, and key benchmarking results.
All tasks split 50/50 between Orion and Daniella
Thank You
Questions?
Orion Powers • opowers2023@my.fit.edu
Daniella Seum • dseum2023@my.fit.edu
Faculty Advisor: Dr. Khaled Slhoub