1 of 2

BrainBench - LLMs Evaluation

Daniella Seum and Orion Powers

Faculty Advisor: Dr. Khaled Slhoub, Dept. of Electrical Engineering and Computer Science,

Florida Institute of Technology

LOCAL LLMS

MOTIVATION

METHODS

METRICS

As large language models become more widely used, there is a growing need for reliable and consistent methods to evaluate their performance. Existing benchmarks often lack transparency, consistency across runs, or the ability to handle varied answer formats. BrainBench was developed to address these gaps by providing an automated, reproducible evaluation framework that enables fair comparison of models and deeper insight into their strengths and limitations.

  • Compiled 3 math datasets of varying difficulty: 8th Grade Math, Calculus I, and Advanced Probability and Statistics
  • Evaluated models across 3 datasets for 3 iterations each
  • Problems were automatically fed through the pipeline, with responses parsed, verified, and logged without manual intervention
  • Extracted final answers via hierarchical keyword and LaTeX parsing, then normalized for numeric and symbolic formats
  • Visualized all results in an Angular + Chart.js dashboard with cross-model comparison views

DISCUSSION

SYSTEM ARCHITECTURE

  • Local models eliminate API costs and allow unlimited test runs on personal hardware
  • All 3 models run at comparable parameter scales, isolating architectural differences rather than size
  • Selected Models:
    • qwen3:4b - Alibaba; known for strong STEM and mathematical reasoning
    • gemma3:4b - Google; well-rounded model with strong reasoning performance
    • phi3:3.8b - Microsoft; built specifically for efficient reasoning in resource-constrained environments

RESULTS

Correctness

Extracted answer matching the expected solution

Latency

Time breakdowns from request to response

Token Usage

Input/output token counts, generation, and processing speed

Hardware

CPU/GPU utilization, power draw, and RAM/VRAM usage

Run Summary

Aggregate accuracy, question counts, and total and average processing time

  • Qwen3:4b was the most accurate model on 2 of 3 datasets; Gemma3:4b led on Calculus I
  • Phi3:3.8b performed poorly across all datasets, scoring below 13% on every dataset
  • Gemma3:4b was the most energy-efficient, with the most correct answers per watt-hour
  • All models showed high consistency across repeated runs, with accuracy varying less than 2%
  • Qwen3:4b generated significantly more output tokens per question, averaging 3-4× more than Gemma3:4b

FIT LOGO

2 of 2

INSTRUCTIONS

  • Do not use a color background
  • Keep all content within the gold lines and blue and crimson bars (other than title block and icons)
  • Do not change the size of the poster
  • Please keep the text readable. The only font allowed is “Calibri”. It is available on all Microsoft products . Minimum font size is 48 pts.
  • The size is already set to exactly the print size. Please arrange and size all the images and text properly.
  • Ensure that all images are at least 300 dpi
  • The name of any faculty should be put in as “Dr. [First Name] [Middle Initial]. [Last Name], Dept. Of [name of department], [name of institution]
  • Names of any sponsors, mentors, volunteers, helpers, etc. can be put in the acknowledgements section. Put in text only. Do not put in any additional logos in the poster other than the ones already in the template.

Your Poster file should be named as follows:

SHOWCASE_SPRING2024_POSTER_CAPSTONE MAJOR_YOURTEAMNAME*

(Example: SHOWCASE_SPRING2024_POSTER_ME_SUNNUCLEAR)

Note: The project name should be exactly as the registered project name.

*If this is an individual project, please place the title of your project instead of the team name

Please follow all instructions above.

The submission may be rejected if the formatting guidelines are violated or the file is not properly named.

**delete this text box**