1 of 11

BrainBench

ML/LLM Models Evaluation Dashboard

Orion Powers • Daniella Seum

Faculty Advisor: Dr. Slhoub | Florida Institute of Technology

Senior Design • Spring 2026

2 of 11

The Problem

Marketing vs. Reality

LLM capabilities are often overstated by vendors, making it hard to know what models can actually do.

Inconsistent Benchmarks

Existing benchmarks use different methodologies, making fair comparisons nearly impossible.

Cost Barriers

Most evaluation tools focus on expensive commercial models, ignoring free and local options.

3 of 11

Our Solution

BrainBench

A unified evaluation framework that tests LLMs under identical conditions, providing transparent, unbiased comparisons through a user-friendly web dashboard.

Standardized testing pipeline
Focus on free & local models
Mathematical reasoning focus
Performance metrics included
Interactive web dashboard

What We Measure

Correctness

Accuracy on math problems

Response Time

Speed of model inference

Consistency

Variance across runs

Resource Usage

Hardware requirements

4 of 11

Our Approach

Reasoning Evaluation

Categorize LLMs based on mathematical reasoning and problem-solving accuracy using standardized datasets and consistent prompting.

Performance Analysis

Evaluate practical metrics like response time, consistency, and computational requirements for both local and cloud models.

Accessible Results

Present findings through a clean, hosted website with visualizations that make comparisons accessible to all users.

5 of 11

System Architecture

Dataset

Module

XML problem sets

Prompting &

Execution

Standardized runs

Parsing &

Scoring

Answer extraction

Metrics &

Logging

Stats & tracking

Results

Storage

Structured output

Web

Dashboard

User interface

User Interfaces

General Users

Browse dashboard, compare models, and view summarized results without needing to run tests.

Admin Users

Run evaluations, update datasets, and publish results via command-line tools and repository workflows.

6 of 11

Tools & Technologies

Evaluation

HuggingFace Transformers

Custom Python Scripts

Standardized Benchmarks

Web Development

Angular Framework

HTML / CSS / JavaScript

Go.js Visualizations

Collaboration

GitHub Version Control

Google Docs

Team Communication

7 of 11

Technical Challenges

Scalability & Repeatability

Maintaining consistent experimental conditions across growing numbers of models and test problems while ensuring reproducibility.

Frontend Data Integration

Managing data flow and state synchronization between the backend evaluation pipeline and the responsive frontend interface.

Visualization of Complex Results

Presenting detailed evaluation data in an intuitive way that balances technical accuracy with clarity for all user types.

8 of 11

Progress Summary

Dataset Module

100%

Prompting & Execution Module

100%

Parsing & Scoring Module

95%

Metrics & Logging Module

85%

Results Storage Module

75%

Web Dashboard Module

Complete

In Progress

Early Stage

Not Started

9 of 11

Milestone 4 • February 23rd

Run all datasets through all LLMs three times

Execute each dataset against all chosen LLMs across three separate runs to capture variability and ensure reliable results.

Interpret correctness results across runs

Analyze correctness outcomes to identify consistency and variance in model performance across executions.

Collect and interpret statistics

Compute average accuracy, variance, and runtime distributions across runs and models for cross-model comparisons.

Begin website development

Start building the web dashboard with core layout, navigation, and initial data pipeline integration.

All tasks split 50/50 between Daniella and Orion

10 of 11

Project Timeline

Milestone 4

Feb 23

Full LLM runs

Data analysis

Begin website

Milestone 5

Mar 30

Finalize website

System evaluation

Create poster

Milestone 6

Apr 20

Final testing

User manual

Demo video