1 of 79

Evaluating Mathematical Reasoning in LLMs

26 February 2025

Best practices for benchmarking and fine-tuning

2 of 79

Agenda

Mathematical reasoning for LLMs: �Key use cases and areas where math reasoning fundamentally enhances language model capabilities.

Auto-formalization in math: �AlphaProof, Lean, and the applicability of automated verifiers for LLM training and evaluation.

Expert-curated data collection: �Best practices for sourcing high-quality datasets to assess and enhance model performance, including for university-level mathematical reasoning.

Designing effective benchmarks: �Developing robust evaluation metrics, assessing model performance across different domains, and extracting actionable insights.

Boosting LLM performance with new data: �Optimally leveraging a custom university-level training dataset to enhance the mathematical proficiency of top-performing LLMs.

Practical applications: �Applying the results in academic, educational, and product contexts.

2

3 of 79

Team

3

Iuliya Beloshapka

��Senior Research Engineer, �Google DeepMind

Alexei Miasnikov

��Research Professor of Mathematics, Stevens Institute of Technology

Vlad Stepanov

��CEO, Gradarius

Konstantin Chernyshev

��Machine Learning Researcher, Toloka

Vitaliy Polshkov

��Senior Research Engineer, Toloka

4 of 79

Konstantin Chernyshev

Research Scientist, Toloka

Overview

4

5 of 79

Introduction: What is Mathematical Reasoning in LLMs?

5

1: Fine-Tuning Enhances Existing Mechanisms – https://arxiv.org/pdf/2402.14811

6 of 79

Introduction: What is Mathematical Reasoning in LLMs?

6

1: Fine-Tuning Enhances Existing Mechanisms – https://arxiv.org/pdf/2402.14811

Beyond memorized facts:

multi-step logic & symbolic manipulation

7 of 79

Introduction: What is Mathematical Reasoning in LLMs?

7

1: Fine-Tuning Enhances Existing Mechanisms – https://arxiv.org/pdf/2402.14811

Beyond memorized facts:

multi-step logic & symbolic manipulation

Differentiates depth:

�surface‑level “pattern matching” vs. actual reasoning

8 of 79

Introduction: What is Mathematical Reasoning in LLMs?

8

1: Fine-Tuning Enhances Existing Mechanisms – https://arxiv.org/pdf/2402.14811

Beyond memorized facts:

multi-step logic & symbolic manipulation

Proxy for capabilities:

Core marker of advanced cognitive capabilities in AI¹

Differentiates depth:

�surface‑level “pattern matching” vs. actual reasoning

9 of 79

Introduction: What is Mathematical Reasoning in LLMs?

9

1: Fine-Tuning Enhances Existing Mechanisms – https://arxiv.org/pdf/2402.14811

Beyond memorized facts:

multi-step logic & symbolic manipulation

Proxy for capabilities:

Core marker of advanced cognitive capabilities in AI¹

Differentiates depth:

�surface‑level “pattern matching” vs. actual reasoning

Deep subjects knowledge:

Test not only calculations but deep methods knowledge

10 of 79

Introduction: Why Does It Matter?

Greater reliability: �Math tasks expose errors / hallucinations more clearly

Deeper skills: �Multi‑step proofs promote more robust logic

Stress-test: �Reveals true reasoning ability

Proxy for “intelligence”: �Math success correlates with complex reasoning

10

11 of 79

Introduction: Key Applications

Education Tutoring

Step‑by‑step solutions, personalized feedback, solution explanations

Coding & Algorithms

Complexity analysis, bug detection, correctness proofs, algorithms speedup

Scientific/Engineering

Interpreting formulas, data processing, simulations

Direct Applications�Direct usage in finance, reports, etc

Math Research / Theorem Proving�Checking and generating proofs (AlphaProof, etc.) incl. Lean/Coq/etc

Automated Agents�Assisting in research beyond pure math (e.g. data analysis)

11

12 of 79

Introduction: Core Challenges

General:

Multi‑step complexity: small generation errors lead to total failure;
Fragility and Sensitivity: rewording or adversarial inputs can break solutions¹;
LLM Design: Transformers are not build for calculations²;
Other question types: Lack of “going beyond solution” to verifying / scoring³;

12

1: GSM-Symbolic – https://arxiv.org/pdf/2410.05229

2: Can Transformers Solve Everything? – https://towardsdatascience.com/can-transformers-solve-everything-0421ebaf9f8e/ �3: Towards Effective Process Supervision in Mathematical Reasoning – https://qwenlm.github.io/blog/qwen2.5-math-prm/

13 of 79

Introduction: Core Challenges

General:

Multi‑step complexity: small generation errors lead to total failure;
Fragility and Sensitivity: rewording or adversarial inputs can break solutions¹;
LLM Design: Transformers are not build for calculations²;
Other question types: Lack of “going beyond solution” to verifying / scoring³;

�Training:

Advanced gap: LLMs often trained mainly on simpler, K‑12 math, loosing on high-level subjects;
Generalization problems: Performance drop when applied to specific

13

1: GSM-Symbolic – https://arxiv.org/pdf/2410.05229

2: Can Transformers Solve Everything? – https://towardsdatascience.com/can-transformers-solve-everything-0421ebaf9f8e/ �3: Towards Effective Process Supervision in Mathematical Reasoning – https://qwenlm.github.io/blog/qwen2.5-math-prm/

14 of 79

Introduction: Core Challenges

General:

Multi‑step complexity: small generation errors lead to total failure;
Fragility and Sensitivity: rewording or adversarial inputs can break solutions¹;
LLM Design: Transformers are not build for calculations²;
Other question types: Lack of “going beyond solution” to verifying / scoring³;

�Training:

Advanced gap: LLMs often trained mainly on simpler, K‑12 math, loosing on high-level subjects;
Generalization problems: Performance drop when applied to specific

�Eval:

“Practicality” Gap: frontier math / olympiads vs real-world applications
Benchmark leakage: common datasets already appear in training;

14

1: GSM-Symbolic – https://arxiv.org/pdf/2410.05229

2: Can Transformers Solve Everything? – https://towardsdatascience.com/can-transformers-solve-everything-0421ebaf9f8e/ �3: Towards Effective Process Supervision in Mathematical Reasoning – https://qwenlm.github.io/blog/qwen2.5-math-prm/

15 of 79

Introduction: Core Challenges

Context: �Modes is given with solution and correct answer with CoT prompt to check whatever they match (spoiler – they do).

Many models solve routine algebra but give up on harder tasks – “cutting corners”
Slightly rephrased question / prompt → drastically different answer
Real college‑level tasks need deeper conceptual leaps

15

16 of 79

Introduction: Current State of the Art

Performance dips on truly novel or visual university-level problems

Even top models ~70–80% accuracy on advanced text tasks; less on visual tasks

16

Model	U-MATH	MATH-500	AIME 2025	MathVista
o1	70.7	96.4	79.0	73.9
o3-mini	68.9	97.9	80.0	-
DeepSeek-R1	68.4	97.3	65.0	-
QwQ-32B-Preview	61.5	90.6	37.0	-
Grok-3 (Beta thinking)	-	-	77.3	-
Gemini-1.5-Pro	60.1	-	27.0	52.1
Qwen2.5-Math-72B	50.2	85.6	-	-
GPT-4o-2024-08-06	43.5	76.6	13.0	63.8
Qwen2.5-72B	41.0	82.6	-	-

17 of 79

Iuliya Beloshapka

Senior Research Engineer,

Google DeepMind

Auto-formalization in math

17

18 of 79

Lean

Lean is an interactive theorem prover that has become popular with mathematicians.

Lean is built on type theory. Any mathematical statement can be formalized in Lean, and any proof can be verified automatically.

Lean is also a programming language, and Lean proofs look like lines of code.

This works because there’s a correspondence between checking logical steps and type-checking.

Lean also allows users to write custom tactics for proof construction.

18

19 of 79

Math in Lean

Type checker tells you if the proof is correct.

Lean is interactive: after every step, the compiler says what is left to prove.

19

20 of 79

Formal Approach

Formalisation of mathematics opens up incredible opportunities for mathematicians, enabling synergies at scale previously unimaginable:

Collaboration between arbitrary large groups of mathematicians becomes possible (Terence Tao)

20

21 of 79

Formal vs Informal Maths

Formal �mathematics

Informal �mathematics

21

~200k

~100m

Verifiable

Small amount �of data

Large amount �of data

Not verifiable

22 of 79

Auto-formalization

If done properly, unlocks massive amounts of training data: all of the human written maths �(textbooks, arxiv, etc)

Gives a perfect reward model for RL

Scarcity of formal data (mathlib is ~100k proofs)

No reliable automatic metrics to track progress

Key Challenges:

22

23 of 79

AlphaProof: IMO

Famous math competition for elite pre-college students

Notoriously difficult problems, designed for novelty

6 best young mathematicians selected from each country

Medal ratio is ~6:3:2:1 for nothing:bronze:silver:gold

Score on IMO 2024 problems

23

24 of 79

AlphaProof: P6 problem

Link to Rishi’s blog

24

25 of 79

Auto-formalization: AlphaProof

25

26 of 79

AlpaProof: solver network

26

value

have: =b hz hz

27 of 79

AlpaProof training

Process infographic of AlpaProof’s reinforcement learning training loop: �Around one million informal math problems are translated into a formal language by a formalizer network. Then a solver network searches for proofs or disproofs of the problems, progressively training itself via the AlphaZero algorithm to solve more challenging problems.

27

28 of 79

Informal Approach

Natural language Verifiers (ORMs, PRMs, BoNs, versions of MCTS)

RL form verifiable feedback (e.g., GRPO DeepSeek): need for final-answer data

No matter how advanced the formal approach becomes, we need harder benchmarks for evaluation, as existing ones are nearly saturated

Toloka U-MATH: �A benchmark of hard, unleaked final-answer mathematical problems

28

29 of 79

Vlad Stepanov

CEO, Gradarius

Data Collection

29

30 of 79

Who We Are?

10+ Years in Math Education

Gradarius: �Teaching university mathematics and prerequisites

Stevens Institute of Technology: �Long-term academic partnership

Real-world experience with student learning challenges

* And who is the LLM, if not a huge digital student? :)

30

31 of 79

Expert-curated Data Collection

General data collection approach covering:�Demonstrations, Preferences, Evaluation datasets, and more.

Human oversight:�Dataset design and quality control by domain super-experts and architects.

Hybrid approach:�An efficient combination of human signals and automated methods, including synthetic data and auto-checks.

Human oversight

31

Dataset design

by Data Architects

Quality Assurance

by Expert Team Leads

Instructions

and Taxonomy

Quality control

and Feedback

Generation by Experts

Synthetic data

Auto

Verification

Verification by Experts

Editing by �Experts

32 of 79

Example: U-MATH dataset

The Questions behind our research => Dataset Design

What's the reality behind LLMs' convincing mathematical responses?

How should this inform educational strategy?

What are the implications for product development and research?

* Two parties

VS

32

”LLMs are useless”

“AI is here and everything turned 180”

33 of 79

University-level Math Problems as an Input

Real problems from university Calculus courses �@ Gradarius

Topics selected by subject matter experts

Converted from our internal format to

LaTeX-containing Markdown

33

Generation by Experts

34 of 79

Verifying data – Automatic Pipeline

We want hard problems, but we don’t want unsolvable

Student solution statistics

almost a cheat-code!

Autovalidate where possible

we used our Math Engine

Formal characteristics (steps #, content length, etc.)

34

Auto�Verification

35 of 79

Data Edit Pipeline

Problem is unsolvable – why?

errors in data vs hard problems

Salvaging content vs throwing out

It’s smart to check twice :)

35

Verification by Experts

Editing by �Experts

Auto�Verification

36 of 79

Result: Real-World University Mathematics Data

Our Unique Dataset Advantage:

Thousands of real students using Gradarius annually
Hence, we have data on human solvability and relevance
Step-by-step reasoning pathways (not just final answers)
Expert solutions alongside student attempts

Table 2: �Average number of questions per problem and answers per question in U-MATH.

36

Math Subject	#Textual	#Visual	Avg. Questions	Avg. Answers
Algebra	150	30	1.93	1.28
Differential Calculus	150	70	2.37	1.15
Integral Calculus	150	58	1.09	1.01
Multivariable Calculus	150	28	1.74	1.09
Precalculus	150	10	1.51	1.23
Sequences and Series	150	4	1.36	1.00
All	900	200	1.66	1.12

37 of 79

Konstantin Chernyshev

Research Scientist, Toloka

Math Skills Evaluation

37

38 of 79

Evals: Why we need Specialized Math Evals?

Math is important in general (discussed before) → need to evaluate

Math tasks demand exactness → reveals reasoning flaws quickly

True pass/fail check – correct vs. incorrect (with some problems for high-level judging)

General “language benchmarks” can mask logic gaps

Tracks real progress → scoreboard for true improvement vs. marketing claims

38

39 of 79

Evals: Benchmark Types

Multiple‑Choice (fast scoring, risk of guessing)

Open‑Ended (step‑by‑step solutions, more realistic but scoring is harder)

Formal Theorem Proving (Lean, Coq – absolute rigor, no partial credit, harder to setup)

Also:

Visual/Multimodal Input (diagrams, charts)

Tables/Schemes Input (harder to follow for LLM)

Multiple Questions and Partial Credits (real-world setup often require multiple answers)

Each type probes different aspects of math reasoning.

39

40 of 79

Evals: Principles of a Good Math Benchmark (1/2)

Leakage�Minimizing public data exposure (unpublished or lesser‑known problems)

Robustness�Variation in wording; no single “trick” solves all

Difficulty Range �Easy (a bit) → advanced tasks to differentiate skill levels

Avoid Oversimplified Formats�Pure multiple‑choice can obscure real reasoning

40

41 of 79

Evals: Principles of a Good Math Benchmark (2/2)

Accurate Evaluation�Tools or partial credit? Symbolic equivalence? Judges?

Sensitivity�Distinguish small improvements; consistent scoring

Diversity & Realism�Reflect practical or course‑based knowledge

Meta‑Evaluation�Evaluate the “judge” itself (LLM or automated grader can be biased)

41

42 of 79

Evals: Existing Benchmarks – Quick Survey

GSM8K / MATH and similar: �K–12 to competition levels, but many problems are “leaked” online

MMLU and similar: �Multiple‑choice, broad coverage including math categories

Competition sets: �AIME, IMO → extremely challenging but small sample, specific/not-applied

“Impossible” problems: �like Humanity Last Exam – hard, but not focused on math and reasoning

42

43 of 79

Evals: Introducing U-MATH

~1,100 university‑level problems, across 6 undergrad subjects, 20% with visuals

Open‑ended → multi‑step solutions with free-fown answer

Drawn from real course materials (not just contests)

Not widely on the web, so less leaked data.

Average number of questions per problem and answers per question in U-MATH.

43

Math Subject	#Textual	#Visual	Avg. Questions	Avg. Answers
Algebra	150	30	1.93	1.28
Differential Calculus	150	70	2.37	1.15
Integral Calculus	150	58	1.09	1.01
Multivariable Calculus	150	28	1.74	1.09
Precalculus	150	10	1.51	1.23
Sequences and Series	150	4	1.36	1.00
All	900	200	1.66	1.12

44 of 79

44

45 of 79

Evals: U-MATH Results – Key Insights

~60% → ~70% top accuracy for using ”reasoning”

Visual tasks remain much harder (~55%)

Open-source vs Proprietary: slight gap actually, larger in visual

Visual encoder reduce model quality.

Continuous fine-tuning and math specialization help significantly

Some subject are under-represented in training data

45

46 of 79

Evals: µ-MATH – Evaluating the Evaluators

A meta‑evaluation subset: checks if an LLM can judge correct vs. incorrect solutions

Task: “Judge whether a proposed solution is correct” → tests LLM’s evaluation skill

Subset 25% of U-MATH problems 4 solutions generated for each problem by Qwen, Gemini, GPT, LLaMA

Can be used to check judge biases toward some model family - additional metrics

46

47 of 79

Evals: Introducing U-MATH

“Non-reasoning” models yields not sufficient for judge score (~81%);

Event “reasoning” are not perfect judges (~90%), yet much slower;

Bias toward model family, towards LLaMA solutions, against Qwen;

Bias sensitive to prompt – changing prompting change rating Non-homogeneously.

47

Model	Is Open	#Params (B)	μ-MATH (F1)	Solution Generated by
Model	Is Open	#Params (B)	μ-MATH (F1)	GPT-4o (F1)	Gemini-1.5-Pro (F1)	Llama-3.1-70B (F1)	Qwen2.5-72B (F1)
“Reasoning” models
o1	–	?	89.5	88.4	90.6	93.2	83.8
o1-mini	–	?	84.8	81.2	86.2	89.7	79.5
DeepSeek-R1	+	684	82.2	79.7	82.0	88.2	76.8
Gemini-2.0-Flash-Thinking	–	?	81.0	74.3	85.8	83.3	76.0
“Large” models
Gemini-1.5-pro	–	?	80.7	78.2	79.5	83.6	77.7
GPT-4o-2024-08-06	–	?	77.4	77.5	72.6	81.8	74.2
Qwen2.5-72B	+	73	75.6	73.7	74.2	79.3	70.5
claude-sonnet-3-5	–	?	74.8	72.2	73.8	77.9	70.8
Qwen2.5-Math-72B	+	73	74.0	68.2	76.8	77.3	69.3
Llama-3.1-70B	+	71	61.0	69.4	58.8	57.0	56.0
“Small” models
Gemini-1.5-Flash	–	?	74.8	70.1	73.9	80.6	71.2
GPT-4o-mini-2024-07-18	–	?	72.3	70.4	69.6	76.2	69.3
Qwen2.5-7B	+	8	69.3	68.3	69.1	72.3	62.4
Qwen2.5-Math-7B	+	8	61.9	57.2	63.8	63.8	59.7

48 of 79

Vitaliy Polshkov

Research Engineer, �Toloka

Evals case-study

48

49 of 79

Applied evals

A benchmark is only as good as your use of it

49

50 of 79

Applied evals

Compile a diverse set of measurements�different benchmarks, benchmark slices, metrics, behavioral characteristics, …

A benchmark is only as good as your use of it

50

51 of 79

Applied evals

Compile a diverse set of measurements�different benchmarks, benchmark slices, metrics, behavioral characteristics, …

Characterize and understand each of them�its purpose, the nature of the underlying data, how do you gain or lose in score, the evaluation methodology, the error margins, …

A benchmark is only as good as your use of it

51

52 of 79

Applied evals

Compile a diverse set of measurements�different benchmarks, benchmark slices, metrics, behavioral characteristics, …

Characterize and understand each of them�its purpose, the nature of the underlying data, how do you gain or lose in score, the evaluation methodology, the error margins, …

Gather datapoints�evaluate different models, vary the model version / size, contrast the results across measurements, seek correlations, study patterns, …

A benchmark is only as good as your use of it

52

53 of 79

Applied evals

Compile a diverse set of measurements�different benchmarks, benchmark slices, metrics, behavioral characteristics, …

Characterize and understand each of them�its purpose, the nature of the underlying data, how do you gain or lose in score, the evaluation methodology, the error margins, …

Gather datapoints�evaluate different models, vary the model version / size, contrast the results across measurements, seek correlations, study patterns, …

Make inferences�draw interpretations, test conjectures, …

A benchmark is only as good as your use of it

53

54 of 79

Hands-on showcase: OpenAI o1 vs DeepSeek R1

The DeepSeek report on popular benchmarks suggests the models are performing equally well, with R1 having a slight edge in math and coding

Considering our results on U-MATH & µ-MATH contradict this, we got intrigued. Let’s zoom in on those two

54

55 of 79

Technique 1: Assembling & characterizing evals

Bringing benchmarks together allows to infer characterization by contrasting them��e.g. U-MATH stands out as a �niche math subdomain

We can repeat in a similar vein, in this case reaffirming the finding with coding

We can also “reverse” this by finding evals from an intended characterization��e.g. unusual skills / domains like µ-MATH

55

All of this suggests lesser generalization,�not approximate parity with some performance accents

56 of 79

Technique 2: Testing conjectures

Lesser generalization → Performance drops on task updates?

56

57 of 79

Technique 3: Studying trends

A tradeoff emerges between problem-solving and judgment at some point
“Spiky” values indicate tradeoff extremities: �Qwen-Math is higher than expected, Sonnet is lower
Reasoners break out of the tradeoff, demonstrating an increase in generalization
The o1 model repeats this again, proving superior not only to R1, but to all the recent reasoners

57

Instead of extending the scope with other benchmarks, we can do it with more models

Meaning that investigating this tradeoff allows�to analyze generalization beyond o1 and R1

“Solving” skill

“Judging” skill

58 of 79

Technique 4: Inspecting & quantifying behavior

Manual model response inspections highlight difficulties in balancing �domain-specific reasoning with coherence and instruction-following

(e.g. Qwen-Math excels at problem-solving at the expense of structure, Claude Sonnet exhibits the opposite)

This yields two distinctive judgment styles:

Conservative: �precise and structured, but anchored on the exact model answer / golden label form��fewer false positives, more false negatives�

Lenient: �verbose and inclined toward complex lengthy derivations, but prone to lose track��fewer false negatives, more false positives

A typical conservative judgment example:

58

The problem-solving vs judgment tradeoff is confirmed when manually examining the model responses

It can be characterized as balancing domain-specific reasoning with general structured logic and instruction-following��For example, Qwen-Math excels at mathematical reasoning at the expense of structure and coherence, while Claude Sonnet exhibits the opposite

These behavioral differences also translate into differences in judgment, yielding two distinctive styles:

Lenient judges: more verbose, inclined toward lengthy derivations, better at comparing complex expressions, but prone to losing track�fewer false negatives but more false positives
Conservative judges: more structured and precise, but anchored on the exact form of the golden label�fewer false positives but more false negatives

�A typical conservative judgment example, resulting in a false negative:

59 of 79

Technique 5: Decomposing performance

We can depict the behavioral differences by decomposing the F1-score into TPR and TNR

Higher TPR means fewer false negatives, higher TNR means fewer false positives

All the same trends are observed:

tradeoff emerging at some point
similar extremities (e.g. Qwen, Claude)
reasoners pushing the frontier away
o1 pushing a step further yet

o1 is also one of the more �balanced models

Imbalances are associated

with entire model families

A more balanced training mixture makes for a more balanced model

Losing in capability, on the opposite, exacerbates the bias

59

“Conservatism”

“Leniency”

We can depict the behavioral differences by decomposing the overall µ-MATH F1-score into True Positive Rate,

which measures correct positive classifications, and True Negative Rate, which tracks rejections of incorrect solutions.

All the same trends are observed: �- tradeoff emerging at some point�- reasoners pushing the frontier away�- o1 pushing a step further yet
Apart from being the most performant, o1 is one of the more balanced models

Previous model generation informs on how to approach this balancing with the next:

Leniency / Conservatism are associated with entire model families
A more balanced training mixture makes for a more balanced model
Losing in capability, on the opposite, exacerbates the biases

Meaning you need to have:

a balanced training setup integrating the depth of formal reasoning with breadth of domain coverage and coherence
appropriate capability to properly generalize over the things that you’re balancing�

Modern reasoner systems do have the appropriate capabilities and already excel at formal reasoning

Conclusion: what’s needed going further is data to improve domain coverage and coherence in reasoning models

60 of 79

Takeaways

Not all benchmarks are made equal

Evals diversity is essential for comprehensive performance assessments
Task novelty is key to track real progress
Understanding the evaluations themselves is required to draw reliable conclusions based on them

�

Quantified results are only means to an end, aim for qualitative findings

Expand your “analytical landscape”: assemble different benchmarks and models, identify meaningful subsets, decompose metrics, etc.
Interpret the numbers’ meaning, analyze the behaviors behind them
Study trends and tendencies — dynamics, correlations, clusters, etc. — to arrive at actionable insights
Make concrete tangible statements, then test them rigorously

60

61 of 79

Vitaliy Polshkov

Research Engineer, �Toloka

Math Skills Enhancement

61

62 of 79

Delivering performance surplus

We explore the problem of improving upon an already existing model by exploiting a standalone high-quality dataset

In particular, our objective is to enhance problem-solving in top-performing math specialists by utilizing our novel data focused on university-level problems

Obstacles with this setup:

Training from a base model proves insufficient
Training from an instruct-tuned model leads to degradation (“catastrophic forgetting”)

Approach: maximize new data utilization while minimizing interference, then combine with the initial instruct-tune

62

general capability

specific skill

OUR MODEL

INSTRUCT

BASE MODEL

SFT

rSFT

general capability

specific skill

OUR MODEL

INSTRUCT

BASE MODEL

SFT

“virtual”�SFT

MERGED MODEL

63 of 79

Reasoning-oriented data preparation

We have a dataset of mathematical problems, supplied with correct answers and reference solutions

Supply reference solutions with more intermediate steps and comprehensive explanations�(this also helps to unify the style)

�Expand the dataset via “rejection sampling”:

2.1) Generate novel solutions for each problem from scratch

2.2) Leave only the ones arriving at the correct answer

2.3) Deduplicate the generated solutions to ensure diversity of approaches

2.4) Ensure each problem gets an equal number of new solutions � (otherwise the problem distribution becomes skewed

�Combine the two sets

63

64 of 79

Tools & tricks

Knowledge retention techniques:

Mix the obtained dataset with data similar to original SFT mix
Supply each sample with info on the datasource
Interleave “blocks” of data sources instead of shuffling
Sweep the warmup period of your lr-curve

Data Utilization techniques:

Arrange a training curriculum so that problem difficulty increases
Decrease the problem statement (“prompt”) loss
Average the checkpoints from different seeds runs
Use fine-tuning aware merging strategies (such as TIES)

64

Model	Model size	GSM8k	MATH	MATH lvl-5	MathOdyssey	U-MATH	Text
GPT-4o	-	0.950	0.758	0.550	0.481	0.462
Our fine-tuned Mathstral	7b	0.859	0.583	0.319	0.382	0.293
Mathstral	7b	0.832	0.486	0.224	0.336	0.189
Our fine-tuned DeepSeek-Math	7b	0.843	0.543	0.262	0.372	0.251
Numina fine-tuned DeepSeek-Math	7b	0.782	0.512	0.232	0.341	0.232
DeepSeek-Math	7b	0.803	0.427	0.168	0.305	0.192

65 of 79

Vlad Stepanov

CEO, �Gradarius

Real-world Implications

65

66 of 79

What do we make of the results?

Can we put AI “behind the driver’s wheel”?

Even best models struggle with university mathematics:

Algebra: Highest accuracy (70-87%), but still not 100%
Differential Calculus: Moderate performance (40-62%)
Integral Calculus: Lowest accuracy (11-55%)

66

67 of 79

What do we make of the results?

Can we put AI “behind the driver’s wheel”?

Even best models struggle with university mathematics:

Algebra: Highest accuracy (70-87%), but still not 100%
Differential Calculus: Moderate performance (40-62%)
Integral Calculus: Lowest accuracy (11-55%)

67

No(t yet)

68 of 79

Use of AI in Math Education

Educational Implications:

AI is not suitable as a self-sufficient educational tool in Math

Current optimal role: AI as translator between Strict deterministic math engines and Natural human understanding

68

User Input

Solution to Judge

Problem to Solve

Router

LLM

Math Engine

Intent, context, …

Math substance

Output

69 of 79

Practical Application Examples

Practical Applications at Gradarius:

SmartCorrect: Contextualizing "almost right" answers

Feedback enhancement: Enriching explanations while ensuring mathematical precision

Content authoring: Supporting problem creation with human verification

69

70 of 79

Alexei Miasnikov

Distinguished Research Professor,

Stevens Institute of Technology, Hoboken, NJ

Real-world Implications�Stevens’ Perspective

70

71 of 79

Big Divide

The math community is divided into two large groups:

The first one believes that ChatGPT (LLMs in general) is the solution to most challenges (if it is not ready now, then it will be in a few years). The primary reason: there are only finitely many problems in math teaching, so ChatGPT can learn them all.

The second one, however, believes that ChatGPT is completely useless and even detrimental to math education.

In Mass Math Education

71

72 of 79

Why isn’t AI universally liked?

It makes mistakes

The answer is correct, but the logic is either incorrect or unclear

Everything seems right, but the explanation is too long, if you ask again, it gives a different reasoning

And there is something else, something important, missing

72

73 of 79

Zero-tolerance to mistakes in math

Due to the inherent probabilistic nature of AI, it is prone to making mathematical errors.

Suppose in the future an LLM solves 99% of math problems correctly.

A class of 100 students doing 10 problems each on a HW weekly.
1% of wrong solutions ~ 10 angry students per week.
Look at the number of frustrated students in your class by the end of the semester!

73

74 of 79

Zero-tolerance to wrong logic

Learning solution steps without reasoning (logic) is just memorizing

Understanding a solution is understanding its logic

Wrong or confusing explanations destroy understanding

74

75 of 79

Are LLMs/AI + Math Solver a solution?

AI without a Math Solver is �very detrimental for math learning.

AI with a Math Solver is, at best, useless.

Results of some experiments:

There is something else missing.

75

Generative AI Can Harm Learning, SSRN 4895486

76 of 79

AI + Math Solver + Math Tutor

To solve a problem, a student needs to produce a sequence of logical steps.

The learning occurs when students “invent” some steps on their own.

It could be hard and frustrating.

How to help in this very instant is the Moment of Truth.

We recently created the “Lab for AI in Math and Math Education” at Stevens Institute

to address all these issues.

76

77 of 79

To see the progress of AI

progress in making correct answers

We need benchmarks to see:

quality of explanations

77

78 of 79

Agenda

Mathematical reasoning for LLMs: �Key use cases and areas where math reasoning fundamentally enhances language model capabilities.

Auto-formalization in math: �AlphaProof, Lean, and the applicability of automated verifiers for LLM training and evaluation.

Expert-curated data collection: �Best practices for sourcing high-quality datasets to assess and enhance model performance, including for university-level mathematical reasoning.

Designing effective benchmarks: �Developing robust evaluation metrics, assessing model performance across different domains, and extracting actionable insights.

Boosting LLM performance with new data: �Optimally leveraging a custom university-level training dataset to enhance the mathematical proficiency of top-performing LLMs.

Practical applications: �Applying the results in academic, educational, and product contexts.

78

79 of 79

Thank You!

Q&A time

79