BrainBench
ML/LLM Models Evaluation Dashboard
Orion Powers • Daniella Seum
Faculty Advisor: Dr. Slhoub | Florida Institute of Technology
Senior Design • Spring 2026
The Problem
1
Marketing vs. Reality
LLM capabilities are often overstated by vendors, making it hard to know what models can actually do.
2
Inconsistent Benchmarks
Existing benchmarks use different methodologies, making fair comparisons nearly impossible.
3
Cost Barriers
Most evaluation tools focus on expensive commercial models, ignoring free and local options.
Our Solution
BrainBench
A unified evaluation framework that tests LLMs under identical conditions, providing transparent, unbiased comparisons through a user-friendly web dashboard.
What We Measure
Correctness
Accuracy on math problems
Response Time
Speed of model inference
Consistency
Variance across runs
Resource Usage
Hardware requirements
Our Approach
1
Reasoning Evaluation
Categorize LLMs based on mathematical reasoning and problem-solving accuracy using standardized datasets and consistent prompting.
2
Performance Analysis
Evaluate practical metrics like response time, consistency, and computational requirements for both local and cloud models.
3
Accessible Results
Present findings through a clean, hosted website with visualizations that make comparisons accessible to all users.
System Architecture
Dataset
Module
XML problem sets
Prompting &
Execution
Standardized runs
Parsing &
Scoring
Answer extraction
Metrics &
Logging
Stats & tracking
Results
Storage
Structured output
Web
Dashboard
User interface
User Interfaces
General Users
Browse dashboard, compare models, and view summarized results without needing to run tests.
Admin Users
Run evaluations, update datasets, and publish results via command-line tools and repository workflows.
Tools & Technologies
Evaluation
HuggingFace Transformers
Custom Python Scripts
Standardized Benchmarks
Web Development
Angular Framework
HTML / CSS / JavaScript
Go.js Visualizations
Collaboration
GitHub Version Control
Google Docs
Team Communication
Technical Challenges
1
Scalability & Repeatability
Maintaining consistent experimental conditions across growing numbers of models and test problems while ensuring reproducibility.
2
Frontend Data Integration
Managing data flow and state synchronization between the backend evaluation pipeline and the responsive frontend interface.
3
Visualization of Complex Results
Presenting detailed evaluation data in an intuitive way that balances technical accuracy with clarity for all user types.
Progress Summary
Dataset Module
100%
Prompting & Execution Module
100%
Parsing & Scoring Module
95%
Metrics & Logging Module
85%
Results Storage Module
75%
Web Dashboard Module
0%
Complete
In Progress
Early Stage
Not Started
Milestone 4 • February 23rd
Run all datasets through all LLMs three times
Execute each dataset against all chosen LLMs across three separate runs to capture variability and ensure reliable results.
Interpret correctness results across runs
Analyze correctness outcomes to identify consistency and variance in model performance across executions.
Collect and interpret statistics
Compute average accuracy, variance, and runtime distributions across runs and models for cross-model comparisons.
Begin website development
Start building the web dashboard with core layout, navigation, and initial data pipeline integration.
All tasks split 50/50 between Daniella and Orion
Project Timeline
Milestone 4
Feb 23
Milestone 5
Mar 30
Milestone 6
Apr 20
Thank You
Questions?