1 of 2

BrainBench - LLMs Evaluation

Daniella Seum and Orion Powers

Faculty Advisor: Dr. Khaled Slhoub, Dept. of Electrical Engineering and Computer Science,

Florida Institute of Technology

LOCAL LLMS

MOTIVATION

METHODS

METRICS

As large language models become more widely used, there is a growing need for reliable and consistent methods to evaluate their performance. Existing benchmarks often lack transparency, consistency across runs, or the ability to handle varied answer formats. BrainBench was developed to address these gaps by providing an automated, reproducible evaluation framework that enables fair comparison of models and deeper insight into their strengths and limitations.

Compiled 3 math datasets of varying difficulty: 8th Grade Math, Calculus I, and Advanced Probability and Statistics
Evaluated models across 3 datasets for 3 iterations each
Problems were automatically fed through the pipeline, with responses parsed, verified, and logged without manual intervention
Extracted final answers via hierarchical keyword and LaTeX parsing, then normalized for numeric and symbolic formats
Visualized all results in an Angular + Chart.js dashboard with cross-model comparison views

DISCUSSION

SYSTEM ARCHITECTURE

Local models eliminate API costs and allow unlimited test runs on personal hardware
All 3 models run at comparable parameter scales, isolating architectural differences rather than size
Selected Models:

qwen3:4b - Alibaba; known for strong STEM and mathematical reasoning
gemma3:4b - Google; well-rounded model with strong reasoning performance
phi3:3.8b - Microsoft; built specifically for efficient reasoning in resource-constrained environments

RESULTS

Correctness	Extracted answer matching the expected solution
Latency	Time breakdowns from request to response
Token Usage	Input/output token counts, generation, and processing speed
Hardware	CPU/GPU utilization, power draw, and RAM/VRAM usage
Run Summary	Aggregate accuracy, question counts, and total and average processing time

Qwen3:4b was the most accurate model on 2 of 3 datasets; Gemma3:4b led on Calculus I
Phi3:3.8b performed poorly across all datasets, scoring below 13% on every dataset
Gemma3:4b was the most energy-efficient, with the most correct answers per watt-hour
All models showed high consistency across repeated runs, with accuracy varying less than 2%
Qwen3:4b generated significantly more output tokens per question, averaging 3-4× more than Gemma3:4b

FIT LOGO

2 of 2

INSTRUCTIONS

Do not use a color background
Keep all content within the gold lines and blue and crimson bars (other than title block and icons)
Do not change the size of the poster
Please keep the text readable. The only font allowed is “Calibri”. It is available on all Microsoft products . Minimum font size is 48 pts.
The size is already set to exactly the print size. Please arrange and size all the images and text properly.
Ensure that all images are at least 300 dpi
The name of any faculty should be put in as “Dr. [First Name] [Middle Initial]. [Last Name], Dept. Of [name of department], [name of institution]
Names of any sponsors, mentors, volunteers, helpers, etc. can be put in the acknowledgements section. Put in text only. Do not put in any additional logos in the poster other than the ones already in the template.

Your Poster file should be named as follows:

SHOWCASE_SPRING2024_POSTER_CAPSTONE MAJOR_YOURTEAMNAME*

(Example: SHOWCASE_SPRING2024_POSTER_ME_SUNNUCLEAR)

Note: The project name should be exactly as the registered project name.

*If this is an individual project, please place the title of your project instead of the team name

Please follow all instructions above.

The submission may be rejected if the formatting guidelines are violated or the file is not properly named.

**delete this text box**