MILESTONE 5 PROGRESS EVALUATION
BrainBench
LLM Evaluation Dashboard
Automated framework for evaluating large language models on structured mathematical datasets
Team Members
Daniella Seum & Orion Powers
Faculty Advisor
Dr. Khaled Slhoub
Dept. of Electrical Engineering & Computer Science
Florida Institute of Technology
01
Progress Summary
Milestone 5 Overview
Pipeline Execution
Successfully ran all three local LLMs across all datasets with multiple iterations per model
Completion
100%
Website Development
Core dashboard views implemented with Chart.js visualizations for accuracy and timing
Completion
80%
System Evaluation
Conducted analysis of LLM performance and framework effectiveness across datasets
Completion
70%
Showcase Poster
First draft complete with project overview, methods, and preliminary results
Completion
70%
02
Task 1
Pipeline Execution Complete
Execution Summary
3
Local LLMs Evaluated
qwen3:4b, gemma3:4b, phi3:3.8b
3
Datasets Covered
8th Grade Math, Calculus I, Advanced Probability & Statistics
3
Iterations per Model
Complete set of runs for consistency measurement
Validation Results
All results correctly parsed and logged
Edge cases handled successfully
Data ready for analysis and integration
Key Challenges
Execution Time
Slower models required extended execution periods and careful monitoring to ensure stability throughout the runs
Stability Monitoring
Continuous oversight needed to maintain consistent performance across all iterations
Status
Successfully Completed
03
Feature
Metrics Recording Capability
Ollama Metrics
total_duration_s
load_duration_s
prompt_eval_count
prompt_eval_duration_s
eval_count
eval_duration_s
generation_speed_tps
prompt_processing_speed_tps
ollama_overhead_s
output_to_input_ratio
Performance Tracking
Comprehensive timing and throughput metrics for each model run
System Metrics
Memory Usage
ram_peak_mb
ram_avg_mb
CPU Metrics
cpu_avg_percent
cpu_peak_percent
GPU Metrics
gpu_util_avg_percent
gpu_util_peak_percent
gpu_vram_peak_mb
gpu_vram_avg_mb
gpu_temp_avg_c
gpu_temp_peak_c
gpu_power_avg_w
gpu_power_peak_w
Energy & Monitoring
energy_estimate_wh
monitoring_duration_s
sample_count
Resource Monitoring
Complete hardware utilization and energy consumption tracking
04
Task 3
System Evaluation & Analysis
LLM Performance Analysis
Accuracy Comparisons
Analyzed performance across datasets and mathematical domains
Consistency Assessment
Examined reliability across repeated runs
Response Time Trends
Evaluated performance timing patterns
Framework Assessment
Pipeline Effectiveness
Evaluated overall system performance
Edge Case Handling
Identified areas for final adjustments
Verification Confidence
Analyzed correctness determination reliability
Key Findings
Model Strengths
Identified where each model performed reliably across different problem types
Performance Patterns
Discovered trends in model behavior and areas of struggle
System Reliability
Confirmed overall framework effectiveness with minor inconsistencies noted
05
Task 2
Website Development Progress
Technology Stack
Angular Framework
Frontend application structure
Chart.js
Interactive data visualizations
Remaining Work
Refining integration process for seamless data flow
Polishing user interface for clarity and usability
Iterative adjustments to data processing layers
Core Features Implemented
Accuracy Comparisons
Per-category performance visualization across models and datasets
Response Time Views
Timing trend analysis and performance metrics display
Dashboard Views
Interactive interface for exploring evaluation results
06
Task 4
Senior Design Showcase Poster
Poster Preview
Showcase Poster
Poster visual will be displayed here
Based on: Showcase Poster.pptx
Poster Sections
1
Abstract
Project overview and objectives
2
Motivation
Problem statement and need
3
Methods
Evaluation approach and datasets
4
Results
Performance findings and analysis
5
Web Application
Dashboard features and interface
Status
First Draft
Finalizing text and graphics
07
Looking Ahead
Next Steps: Milestone 6
System Testing
Conduct full test and clean up of completed BrainBench system to ensure all components function together reliably
Both team members
Final Evaluation
Finalize overall system evaluation by conducting comprehensive analysis and incorporating final adjustments
Both team members
User Manual
Create comprehensive user and developer manual documenting how to use, run, and extend the BrainBench system
Both team members
Demo Video
Produce demo video presenting system functionality and purpose with clear walkthrough of key features
Both team members
Research Paper
Develop research paper documenting design, implementation, and evaluation of the BrainBench system
Both team members
Final Milestone
Completing all deliverables for project conclusion and Senior Design Showcase presentation
Questions?
Thank you for your attention.
Daniella Seum
dseum2023@my.fit.edu
Orion Powers
opowers2023@my.fit.edu
Faculty Advisor: Dr. Khaled Slhoub | Florida Institute of Technology