1 of 9

MILESTONE 5 PROGRESS EVALUATION

BrainBench

LLM Evaluation Dashboard

Automated framework for evaluating large language models on structured mathematical datasets

Team Members

Daniella Seum & Orion Powers

Faculty Advisor

Dr. Khaled Slhoub

Dept. of Electrical Engineering & Computer Science

Florida Institute of Technology

2 of 9

01

Progress Summary

Milestone 5 Overview

Pipeline Execution

Successfully ran all three local LLMs across all datasets with multiple iterations per model

Completion

100%

Website Development

Core dashboard views implemented with Chart.js visualizations for accuracy and timing

Completion

80%

System Evaluation

Conducted analysis of LLM performance and framework effectiveness across datasets

Completion

70%

Showcase Poster

First draft complete with project overview, methods, and preliminary results

Completion

70%

3 of 9

02

Task 1

Pipeline Execution Complete

Execution Summary

3

Local LLMs Evaluated

qwen3:4b, gemma3:4b, phi3:3.8b

3

Datasets Covered

8th Grade Math, Calculus I, Advanced Probability & Statistics

3

Iterations per Model

Complete set of runs for consistency measurement

Validation Results

All results correctly parsed and logged

Edge cases handled successfully

Data ready for analysis and integration

Key Challenges

Execution Time

Slower models required extended execution periods and careful monitoring to ensure stability throughout the runs

Stability Monitoring

Continuous oversight needed to maintain consistent performance across all iterations

Status

Successfully Completed

4 of 9

03

Feature

Metrics Recording Capability

Ollama Metrics

total_duration_s

load_duration_s

prompt_eval_count

prompt_eval_duration_s

eval_count

eval_duration_s

generation_speed_tps

prompt_processing_speed_tps

ollama_overhead_s

output_to_input_ratio

Performance Tracking

Comprehensive timing and throughput metrics for each model run

System Metrics

Memory Usage

ram_peak_mb

ram_avg_mb

CPU Metrics

cpu_avg_percent

cpu_peak_percent

GPU Metrics

gpu_util_avg_percent

gpu_util_peak_percent

gpu_vram_peak_mb

gpu_vram_avg_mb

gpu_temp_avg_c

gpu_temp_peak_c

gpu_power_avg_w

gpu_power_peak_w

Energy & Monitoring

energy_estimate_wh

monitoring_duration_s

sample_count

Resource Monitoring

Complete hardware utilization and energy consumption tracking

5 of 9

04

Task 3

System Evaluation & Analysis

LLM Performance Analysis

Accuracy Comparisons

Analyzed performance across datasets and mathematical domains

Consistency Assessment

Examined reliability across repeated runs

Response Time Trends

Evaluated performance timing patterns

Framework Assessment

Pipeline Effectiveness

Evaluated overall system performance

Edge Case Handling

Identified areas for final adjustments

Verification Confidence

Analyzed correctness determination reliability

Key Findings

Model Strengths

Identified where each model performed reliably across different problem types

Performance Patterns

Discovered trends in model behavior and areas of struggle

System Reliability

Confirmed overall framework effectiveness with minor inconsistencies noted

6 of 9

05

Task 2

Website Development Progress

Technology Stack

Angular Framework

Frontend application structure

Chart.js

Interactive data visualizations

Remaining Work

Refining integration process for seamless data flow

Polishing user interface for clarity and usability

Iterative adjustments to data processing layers

Core Features Implemented

Accuracy Comparisons

Per-category performance visualization across models and datasets

Response Time Views

Timing trend analysis and performance metrics display

Dashboard Views

Interactive interface for exploring evaluation results

7 of 9

06

Task 4

Senior Design Showcase Poster

Poster Preview

Showcase Poster

Poster visual will be displayed here

Based on: Showcase Poster.pptx

Poster Sections

1

Abstract

Project overview and objectives

2

Motivation

Problem statement and need

3

Methods

Evaluation approach and datasets

4

Results

Performance findings and analysis

5

Web Application

Dashboard features and interface

Status

First Draft

Finalizing text and graphics

8 of 9

07

Looking Ahead

Next Steps: Milestone 6

System Testing

Conduct full test and clean up of completed BrainBench system to ensure all components function together reliably

Both team members

Final Evaluation

Finalize overall system evaluation by conducting comprehensive analysis and incorporating final adjustments

Both team members

User Manual

Create comprehensive user and developer manual documenting how to use, run, and extend the BrainBench system

Both team members

Demo Video

Produce demo video presenting system functionality and purpose with clear walkthrough of key features

Both team members

Research Paper

Develop research paper documenting design, implementation, and evaluation of the BrainBench system

Both team members

Final Milestone

Completing all deliverables for project conclusion and Senior Design Showcase presentation

9 of 9

Questions?

Thank you for your attention.

Daniella Seum

dseum2023@my.fit.edu

Orion Powers

opowers2023@my.fit.edu

Faculty Advisor: Dr. Khaled Slhoub | Florida Institute of Technology