Evaluating Mathematical Reasoning in LLMs
26 February 2025
Best practices for benchmarking and fine-tuning
Agenda
Mathematical reasoning for LLMs: �Key use cases and areas where math reasoning fundamentally enhances language model capabilities.
Auto-formalization in math: �AlphaProof, Lean, and the applicability of automated verifiers for LLM training and evaluation.
Expert-curated data collection: �Best practices for sourcing high-quality datasets to assess and enhance model performance, including for university-level mathematical reasoning.
Designing effective benchmarks: �Developing robust evaluation metrics, assessing model performance across different domains, and extracting actionable insights.
Boosting LLM performance with new data: �Optimally leveraging a custom university-level training dataset to enhance the mathematical proficiency of top-performing LLMs.
Practical applications: �Applying the results in academic, educational, and product contexts.
2
Team
3
Iuliya Beloshapka
��Senior Research Engineer, �Google DeepMind
Alexei Miasnikov
��Research Professor of Mathematics, Stevens Institute of Technology
Vlad Stepanov
��CEO, Gradarius
Konstantin Chernyshev
��Machine Learning Researcher, Toloka
Vitaliy Polshkov
��Senior Research Engineer, Toloka
Konstantin Chernyshev
Research Scientist, Toloka
Overview
4
Introduction: What is Mathematical Reasoning in LLMs?
5
1: Fine-Tuning Enhances Existing Mechanisms – https://arxiv.org/pdf/2402.14811
Introduction: What is Mathematical Reasoning in LLMs?
6
1: Fine-Tuning Enhances Existing Mechanisms – https://arxiv.org/pdf/2402.14811
Beyond memorized facts:
multi-step logic & symbolic manipulation
Introduction: What is Mathematical Reasoning in LLMs?
7
1: Fine-Tuning Enhances Existing Mechanisms – https://arxiv.org/pdf/2402.14811
Beyond memorized facts:
multi-step logic & symbolic manipulation
Differentiates depth:
�surface‑level “pattern matching” vs. actual reasoning
Introduction: What is Mathematical Reasoning in LLMs?
8
1: Fine-Tuning Enhances Existing Mechanisms – https://arxiv.org/pdf/2402.14811
Beyond memorized facts:
multi-step logic & symbolic manipulation
Proxy for capabilities:
Core marker of advanced cognitive capabilities in AI1
Differentiates depth:
�surface‑level “pattern matching” vs. actual reasoning
Introduction: What is Mathematical Reasoning in LLMs?
9
1: Fine-Tuning Enhances Existing Mechanisms – https://arxiv.org/pdf/2402.14811
Beyond memorized facts:
multi-step logic & symbolic manipulation
Proxy for capabilities:
Core marker of advanced cognitive capabilities in AI1
Differentiates depth:
�surface‑level “pattern matching” vs. actual reasoning
Deep subjects knowledge:
Test not only calculations but deep methods knowledge
Introduction: Why Does It Matter?
Greater reliability: �Math tasks expose errors / hallucinations more clearly
Deeper skills: �Multi‑step proofs promote more robust logic
Stress-test: �Reveals true reasoning ability
Proxy for “intelligence”: �Math success correlates with complex reasoning
10
Introduction: Key Applications
Education Tutoring
Step‑by‑step solutions, personalized feedback, solution explanations
Coding & Algorithms
Complexity analysis, bug detection, correctness proofs, algorithms speedup
Scientific/Engineering
Interpreting formulas, data processing, simulations
Direct Applications�Direct usage in finance, reports, etc
Math Research / Theorem Proving�Checking and generating proofs (AlphaProof, etc.) incl. Lean/Coq/etc
Automated Agents�Assisting in research beyond pure math (e.g. data analysis)
11
Introduction: Core Challenges
General:
12
1: GSM-Symbolic – https://arxiv.org/pdf/2410.05229
2: Can Transformers Solve Everything? – https://towardsdatascience.com/can-transformers-solve-everything-0421ebaf9f8e/ �3: Towards Effective Process Supervision in Mathematical Reasoning – https://qwenlm.github.io/blog/qwen2.5-math-prm/
Introduction: Core Challenges
General:
�Training:
13
1: GSM-Symbolic – https://arxiv.org/pdf/2410.05229
2: Can Transformers Solve Everything? – https://towardsdatascience.com/can-transformers-solve-everything-0421ebaf9f8e/ �3: Towards Effective Process Supervision in Mathematical Reasoning – https://qwenlm.github.io/blog/qwen2.5-math-prm/
Introduction: Core Challenges
General:
�Training:
�Eval:
14
1: GSM-Symbolic – https://arxiv.org/pdf/2410.05229
2: Can Transformers Solve Everything? – https://towardsdatascience.com/can-transformers-solve-everything-0421ebaf9f8e/ �3: Towards Effective Process Supervision in Mathematical Reasoning – https://qwenlm.github.io/blog/qwen2.5-math-prm/
Introduction: Core Challenges
Context: �Modes is given with solution and correct answer with CoT prompt to check whatever they match (spoiler – they do).
15
Introduction: Current State of the Art
Performance dips on truly novel or visual university-level problems
Even top models ~70–80% accuracy on advanced text tasks; less on visual tasks
16
Model | U-MATH | MATH-500 | AIME 2025 | MathVista |
o1 | 70.7 | 96.4 | 79.0 | 73.9 |
o3-mini | 68.9 | 97.9 | 80.0 | - |
DeepSeek-R1 | 68.4 | 97.3 | 65.0 | - |
QwQ-32B-Preview | 61.5 | 90.6 | 37.0 | - |
Grok-3 (Beta thinking) | - | - | 77.3 | - |
Gemini-1.5-Pro | 60.1 | - | 27.0 | 52.1 |
Qwen2.5-Math-72B | 50.2 | 85.6 | - | - |
GPT-4o-2024-08-06 | 43.5 | 76.6 | 13.0 | 63.8 |
Qwen2.5-72B | 41.0 | 82.6 | - | - |
Iuliya Beloshapka
Senior Research Engineer,
Google DeepMind
Auto-formalization in math
17
Lean
Lean is an interactive theorem prover that has become popular with mathematicians.
Lean is built on type theory. Any mathematical statement can be formalized in Lean, and any proof can be verified automatically.
Lean is also a programming language, and Lean proofs look like lines of code.
This works because there’s a correspondence between checking logical steps and type-checking.
Lean also allows users to write custom tactics for proof construction.
18
Math in Lean
Type checker tells you if the proof is correct.
Lean is interactive: after every step, the compiler says what is left to prove.
19
Formal Approach
Formalisation of mathematics opens up incredible opportunities for mathematicians, enabling synergies at scale previously unimaginable:
Collaboration between arbitrary large groups of mathematicians becomes possible (Terence Tao)
20
Formal vs Informal Maths
Formal �mathematics
Informal �mathematics
21
~200k
~100m
Verifiable
Small amount �of data
Large amount �of data
Not verifiable
Auto-formalization
If done properly, unlocks massive amounts of training data: all of the human written maths �(textbooks, arxiv, etc)
Gives a perfect reward model for RL
Scarcity of formal data (mathlib is ~100k proofs)
No reliable automatic metrics to track progress
Key Challenges:
22
AlphaProof: IMO
Famous math competition for elite pre-college students
Notoriously difficult problems, designed for novelty
6 best young mathematicians selected from each country
Medal ratio is ~6:3:2:1 for nothing:bronze:silver:gold
Score on IMO 2024 problems
23
AlphaProof: P6 problem
Link to Rishi’s blog
24
Auto-formalization: AlphaProof
25
AlpaProof: solver network
26
value
have: =b hz hz
AlpaProof training
Process infographic of AlpaProof’s reinforcement learning training loop: �Around one million informal math problems are translated into a formal language by a formalizer network. Then a solver network searches for proofs or disproofs of the problems, progressively training itself via the AlphaZero algorithm to solve more challenging problems.
27
Informal Approach
Natural language Verifiers (ORMs, PRMs, BoNs, versions of MCTS)
RL form verifiable feedback (e.g., GRPO DeepSeek): need for final-answer data
No matter how advanced the formal approach becomes, we need harder benchmarks for evaluation, as existing ones are nearly saturated
Toloka U-MATH: �A benchmark of hard, unleaked final-answer mathematical problems
28
Vlad Stepanov
CEO, Gradarius
Data Collection
29
Who We Are?
10+ Years in Math Education
Gradarius: �Teaching university mathematics and prerequisites
Stevens Institute of Technology: �Long-term academic partnership
Real-world experience with student learning challenges
* And who is the LLM, if not a huge digital student? :)
30
Expert-curated Data Collection
General data collection approach covering:�Demonstrations, Preferences, Evaluation datasets, and more.
Human oversight:�Dataset design and quality control by domain super-experts and architects.
Hybrid approach:�An efficient combination of human signals and automated methods, including synthetic data and auto-checks.
Human oversight
31
Dataset design
by Data Architects
Quality Assurance
by Expert Team Leads
Instructions
and Taxonomy
Quality control
and Feedback
Generation by Experts
Synthetic data
Auto
Verification
Verification by Experts
Editing by �Experts
Example: U-MATH dataset
The Questions behind our research => Dataset Design
What's the reality behind LLMs' convincing mathematical responses?
How should this inform educational strategy?
What are the implications for product development and research?
* Two parties
VS
32
”LLMs are useless”
“AI is here and everything turned 180”
University-level Math Problems as an Input
Real problems from university Calculus courses �@ Gradarius
Topics selected by subject matter experts
Converted from our internal format to
LaTeX-containing Markdown
33
Generation by Experts
Verifying data – Automatic Pipeline
We want hard problems, but we don’t want unsolvable
Student solution statistics
almost a cheat-code!
Autovalidate where possible
we used our Math Engine
Formal characteristics (steps #, content length, etc.)
34
Auto�Verification
Data Edit Pipeline
Problem is unsolvable – why?
errors in data vs hard problems
Salvaging content vs throwing out
It’s smart to check twice :)
35
Verification by Experts
Editing by �Experts
Auto�Verification
Result: Real-World University Mathematics Data
Our Unique Dataset Advantage:
Table 2: �Average number of questions per problem and answers per question in U-MATH.
36
Math Subject | #Textual | #Visual | Avg. Questions | Avg. Answers |
Algebra | 150 | 30 | 1.93 | 1.28 |
Differential Calculus | 150 | 70 | 2.37 | 1.15 |
Integral Calculus | 150 | 58 | 1.09 | 1.01 |
Multivariable Calculus | 150 | 28 | 1.74 | 1.09 |
Precalculus | 150 | 10 | 1.51 | 1.23 |
Sequences and Series | 150 | 4 | 1.36 | 1.00 |
All | 900 | 200 | 1.66 | 1.12 |
Konstantin Chernyshev
Research Scientist, Toloka
Math Skills Evaluation
37
Evals: Why we need Specialized Math Evals?
Math is important in general (discussed before) → need to evaluate
Math tasks demand exactness → reveals reasoning flaws quickly
True pass/fail check – correct vs. incorrect (with some problems for high-level judging)
General “language benchmarks” can mask logic gaps
Tracks real progress → scoreboard for true improvement vs. marketing claims
38
Evals: Benchmark Types
Multiple‑Choice (fast scoring, risk of guessing)
Open‑Ended (step‑by‑step solutions, more realistic but scoring is harder)
Formal Theorem Proving (Lean, Coq – absolute rigor, no partial credit, harder to setup)
Also:
Visual/Multimodal Input (diagrams, charts)
Tables/Schemes Input (harder to follow for LLM)
Multiple Questions and Partial Credits (real-world setup often require multiple answers)
Each type probes different aspects of math reasoning.
39
Evals: Principles of a Good Math Benchmark (1/2)
Leakage�Minimizing public data exposure (unpublished or lesser‑known problems)
Robustness�Variation in wording; no single “trick” solves all
Difficulty Range �Easy (a bit) → advanced tasks to differentiate skill levels
Avoid Oversimplified Formats�Pure multiple‑choice can obscure real reasoning
40
Evals: Principles of a Good Math Benchmark (2/2)
Accurate Evaluation�Tools or partial credit? Symbolic equivalence? Judges?
Sensitivity�Distinguish small improvements; consistent scoring
Diversity & Realism�Reflect practical or course‑based knowledge
Meta‑Evaluation�Evaluate the “judge” itself (LLM or automated grader can be biased)
41
Evals: Existing Benchmarks – Quick Survey
GSM8K / MATH and similar: �K–12 to competition levels, but many problems are “leaked” online
MMLU and similar: �Multiple‑choice, broad coverage including math categories
Competition sets: �AIME, IMO → extremely challenging but small sample, specific/not-applied
“Impossible” problems: �like Humanity Last Exam – hard, but not focused on math and reasoning
42
Evals: Introducing U-MATH
~1,100 university‑level problems, across 6 undergrad subjects, 20% with visuals
Open‑ended → multi‑step solutions with free-fown answer
Drawn from real course materials (not just contests)
Not widely on the web, so less leaked data.
Average number of questions per problem and answers per question in U-MATH.
43
Math Subject | #Textual | #Visual | Avg. Questions | Avg. Answers |
Algebra | 150 | 30 | 1.93 | 1.28 |
Differential Calculus | 150 | 70 | 2.37 | 1.15 |
Integral Calculus | 150 | 58 | 1.09 | 1.01 |
Multivariable Calculus | 150 | 28 | 1.74 | 1.09 |
Precalculus | 150 | 10 | 1.51 | 1.23 |
Sequences and Series | 150 | 4 | 1.36 | 1.00 |
All | 900 | 200 | 1.66 | 1.12 |
44
Evals: U-MATH Results – Key Insights
~60% → ~70% top accuracy for using ”reasoning”
Visual tasks remain much harder (~55%)
Open-source vs Proprietary: slight gap actually, larger in visual
Visual encoder reduce model quality.
Continuous fine-tuning and math specialization help significantly
Some subject are under-represented in training data
45
Evals: µ-MATH – Evaluating the Evaluators
A meta‑evaluation subset: checks if an LLM can judge correct vs. incorrect solutions
Task: “Judge whether a proposed solution is correct” → tests LLM’s evaluation skill
Subset 25% of U-MATH problems 4 solutions generated for each problem by Qwen, Gemini, GPT, LLaMA
Can be used to check judge biases toward some model family - additional metrics
46
Evals: Introducing U-MATH
“Non-reasoning” models yields not sufficient for judge score (~81%);
Event “reasoning” are not perfect judges (~90%), yet much slower;
Bias toward model family, towards LLaMA solutions, against Qwen;
Bias sensitive to prompt – changing prompting change rating Non-homogeneously.
47
Model | Is Open | #Params (B) | μ-MATH (F1) | Solution Generated by | |||
GPT-4o (F1) | Gemini-1.5-Pro (F1) | Llama-3.1-70B (F1) | Qwen2.5-72B (F1) | ||||
“Reasoning” models | |||||||
o1 | – | ? | 89.5 | 88.4 | 90.6 | 93.2 | 83.8 |
o1-mini | – | ? | 84.8 | 81.2 | 86.2 | 89.7 | 79.5 |
DeepSeek-R1 | + | 684 | 82.2 | 79.7 | 82.0 | 88.2 | 76.8 |
Gemini-2.0-Flash-Thinking | – | ? | 81.0 | 74.3 | 85.8 | 83.3 | 76.0 |
“Large” models | |||||||
Gemini-1.5-pro | – | ? | 80.7 | 78.2 | 79.5 | 83.6 | 77.7 |
GPT-4o-2024-08-06 | – | ? | 77.4 | 77.5 | 72.6 | 81.8 | 74.2 |
Qwen2.5-72B | + | 73 | 75.6 | 73.7 | 74.2 | 79.3 | 70.5 |
claude-sonnet-3-5 | – | ? | 74.8 | 72.2 | 73.8 | 77.9 | 70.8 |
Qwen2.5-Math-72B | + | 73 | 74.0 | 68.2 | 76.8 | 77.3 | 69.3 |
Llama-3.1-70B | + | 71 | 61.0 | 69.4 | 58.8 | 57.0 | 56.0 |
“Small” models | |||||||
Gemini-1.5-Flash | – | ? | 74.8 | 70.1 | 73.9 | 80.6 | 71.2 |
GPT-4o-mini-2024-07-18 | – | ? | 72.3 | 70.4 | 69.6 | 76.2 | 69.3 |
Qwen2.5-7B | + | 8 | 69.3 | 68.3 | 69.1 | 72.3 | 62.4 |
Qwen2.5-Math-7B | + | 8 | 61.9 | 57.2 | 63.8 | 63.8 | 59.7 |
Vitaliy Polshkov
Research Engineer, �Toloka
Evals case-study
48
Applied evals
A benchmark is only as good as your use of it
49
Applied evals
Compile a diverse set of measurements�different benchmarks, benchmark slices, metrics, behavioral characteristics, …
A benchmark is only as good as your use of it
50
Applied evals
Compile a diverse set of measurements�different benchmarks, benchmark slices, metrics, behavioral characteristics, …
Characterize and understand each of them�its purpose, the nature of the underlying data, how do you gain or lose in score, the evaluation methodology, the error margins, …
A benchmark is only as good as your use of it
51
Applied evals
Compile a diverse set of measurements�different benchmarks, benchmark slices, metrics, behavioral characteristics, …
Characterize and understand each of them�its purpose, the nature of the underlying data, how do you gain or lose in score, the evaluation methodology, the error margins, …
Gather datapoints�evaluate different models, vary the model version / size, contrast the results across measurements, seek correlations, study patterns, …
A benchmark is only as good as your use of it
52
Applied evals
Compile a diverse set of measurements�different benchmarks, benchmark slices, metrics, behavioral characteristics, …
Characterize and understand each of them�its purpose, the nature of the underlying data, how do you gain or lose in score, the evaluation methodology, the error margins, …
Gather datapoints�evaluate different models, vary the model version / size, contrast the results across measurements, seek correlations, study patterns, …
Make inferences�draw interpretations, test conjectures, …
A benchmark is only as good as your use of it
53
Hands-on showcase: OpenAI o1 vs DeepSeek R1
The DeepSeek report on popular benchmarks suggests the models are performing equally well, with R1 having a slight edge in math and coding
Considering our results on U-MATH & µ-MATH contradict this, we got intrigued. Let’s zoom in on those two
54
Technique 1: Assembling & characterizing evals
Bringing benchmarks together allows to infer characterization by contrasting them��e.g. U-MATH stands out as a �niche math subdomain
We can repeat in a similar vein, in this case reaffirming the finding with coding
We can also “reverse” this by finding evals from an intended characterization��e.g. unusual skills / domains like µ-MATH
55
All of this suggests lesser generalization,�not approximate parity with some performance accents
Technique 2: Testing conjectures
Lesser generalization → Performance drops on task updates?
56
Technique 3: Studying trends
57
Instead of extending the scope with other benchmarks, we can do it with more models
Meaning that investigating this tradeoff allows�to analyze generalization beyond o1 and R1
“Solving” skill
“Judging” skill
Technique 4: Inspecting & quantifying behavior
Manual model response inspections highlight difficulties in balancing �domain-specific reasoning with coherence and instruction-following
(e.g. Qwen-Math excels at problem-solving at the expense of structure, Claude Sonnet exhibits the opposite)
This yields two distinctive judgment styles:
Conservative: �precise and structured, but anchored on the exact model answer / golden label form��fewer false positives, more false negatives�
Lenient: �verbose and inclined toward complex lengthy derivations, but prone to lose track��fewer false negatives, more false positives
A typical conservative judgment example:
58
Technique 5: Decomposing performance
We can depict the behavioral differences by decomposing the F1-score into TPR and TNR
Higher TPR means fewer false negatives, higher TNR means fewer false positives
All the same trends are observed:
o1 is also one of the more �balanced models
Imbalances are associated
with entire model families
A more balanced training mixture makes for a more balanced model
Losing in capability, on the opposite, exacerbates the bias
59
“Conservatism”
“Leniency”
Takeaways
Not all benchmarks are made equal
�
Quantified results are only means to an end, aim for qualitative findings
60
Vitaliy Polshkov
Research Engineer, �Toloka
Math Skills Enhancement
61
Delivering performance surplus
We explore the problem of improving upon an already existing model by exploiting a standalone high-quality dataset
In particular, our objective is to enhance problem-solving in top-performing math specialists by utilizing our novel data focused on university-level problems
Obstacles with this setup:
Approach: maximize new data utilization while minimizing interference, then combine with the initial instruct-tune
62
general capability
specific skill
OUR MODEL
INSTRUCT
BASE MODEL
SFT
rSFT
general capability
specific skill
OUR MODEL
INSTRUCT
BASE MODEL
SFT
“virtual”�SFT
MERGED MODEL
Reasoning-oriented data preparation
We have a dataset of mathematical problems, supplied with correct answers and reference solutions
Supply reference solutions with more intermediate steps and comprehensive explanations�(this also helps to unify the style)
�Expand the dataset via “rejection sampling”:
2.1) Generate novel solutions for each problem from scratch
2.2) Leave only the ones arriving at the correct answer
2.3) Deduplicate the generated solutions to ensure diversity of approaches
2.4) Ensure each problem gets an equal number of new solutions � (otherwise the problem distribution becomes skewed
�Combine the two sets
63
Tools & tricks
Knowledge retention techniques:
Data Utilization techniques:
64
Model | Model size | GSM8k | MATH | MATH lvl-5 | MathOdyssey | U-MATH | Text |
GPT-4o | - | 0.950 | 0.758 | 0.550 | 0.481 | 0.462 | |
Our fine-tuned Mathstral | 7b | 0.859 | 0.583 | 0.319 | 0.382 | 0.293 | |
Mathstral | 7b | 0.832 | 0.486 | 0.224 | 0.336 | 0.189 | |
Our fine-tuned DeepSeek-Math | 7b | 0.843 | 0.543 | 0.262 | 0.372 | 0.251 | |
Numina fine-tuned DeepSeek-Math | 7b | 0.782 | 0.512 | 0.232 | 0.341 | 0.232 | |
DeepSeek-Math | 7b | 0.803 | 0.427 | 0.168 | 0.305 | 0.192 | |
Vlad Stepanov
CEO, �Gradarius
Real-world Implications
65
What do we make of the results?
Can we put AI “behind the driver’s wheel”?
Even best models struggle with university mathematics:
66
What do we make of the results?
Can we put AI “behind the driver’s wheel”?
Even best models struggle with university mathematics:
67
No(t yet)
Use of AI in Math Education
Educational Implications:
AI is not suitable as a self-sufficient educational tool in Math
Current optimal role: AI as translator between Strict deterministic math engines and Natural human understanding
68
User Input
Solution to Judge
Problem to Solve
Router
LLM
Math Engine
Intent, context, …
Math substance
Math substance
Output
Practical Application Examples
Practical Applications at Gradarius:
SmartCorrect: Contextualizing "almost right" answers
Feedback enhancement: Enriching explanations while ensuring mathematical precision
Content authoring: Supporting problem creation with human verification
69
Alexei Miasnikov
Distinguished Research Professor,
Stevens Institute of Technology, Hoboken, NJ
Real-world Implications�Stevens’ Perspective
70
Big Divide
The math community is divided into two large groups:
The first one believes that ChatGPT (LLMs in general) is the solution to most challenges (if it is not ready now, then it will be in a few years). The primary reason: there are only finitely many problems in math teaching, so ChatGPT can learn them all.
The second one, however, believes that ChatGPT is completely useless and even detrimental to math education.
In Mass Math Education
71
Why isn’t AI universally liked?
It makes mistakes
The answer is correct, but the logic is either incorrect or unclear
Everything seems right, but the explanation is too long, if you ask again, it gives a different reasoning
And there is something else, something important, missing
72
Zero-tolerance to mistakes in math
Due to the inherent probabilistic nature of AI, it is prone to making mathematical errors.
Suppose in the future an LLM solves 99% of math problems correctly.
73
Zero-tolerance to wrong logic
Learning solution steps without reasoning (logic) is just memorizing
Understanding a solution is understanding its logic
Wrong or confusing explanations destroy understanding
74
Are LLMs/AI + Math Solver a solution?
AI without a Math Solver is �very detrimental for math learning.
AI with a Math Solver is, at best, useless.
Results of some experiments:
There is something else missing.
75
AI + Math Solver + Math Tutor
To solve a problem, a student needs to produce a sequence of logical steps.
The learning occurs when students “invent” some steps on their own.
It could be hard and frustrating.
How to help in this very instant is the Moment of Truth.
We recently created the “Lab for AI in Math and Math Education” at Stevens Institute
to address all these issues.
76
To see the progress of AI
progress in making correct answers
We need benchmarks to see:
quality of explanations
77
Agenda
Mathematical reasoning for LLMs: �Key use cases and areas where math reasoning fundamentally enhances language model capabilities.
Auto-formalization in math: �AlphaProof, Lean, and the applicability of automated verifiers for LLM training and evaluation.
Expert-curated data collection: �Best practices for sourcing high-quality datasets to assess and enhance model performance, including for university-level mathematical reasoning.
Designing effective benchmarks: �Developing robust evaluation metrics, assessing model performance across different domains, and extracting actionable insights.
Boosting LLM performance with new data: �Optimally leveraging a custom university-level training dataset to enhance the mathematical proficiency of top-performing LLMs.
Practical applications: �Applying the results in academic, educational, and product contexts.
78
Thank You!
Q&A time
79