Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.
Reset | Size | Date | Sampling Budget | Overall |
Kimina-Prover | 7B | 2025-4-14 | 32 | 16.46 |
STP | 7B | 2025-03-20 | 32 | 13.87 |
Goedel-Prover-SFT | 7B13td> | 2025-02-11 | 32 | 13.53 |
DeepSeek-V1.5-RL | 7B | 2024-08-15 | 32 | 10.18 |
DeepSeek-V1.5-SFT | 7B | 2024-08-15 | 32 | 8.97 |
InternLM-Prover | 7B | 2024-10-21 | 1×32×100 | 11.13 |
BFS-Prover | 7B | 2025-2-25 | 1×32×100 | 1.16 |
Overall results of different models on the FormalMATH-All.
Formal mathematical reasoning (FMR) provides a rigorous axiomatic framework essential for automated proof validation. However, due to its inherent complexity, even expert mathematicians find it challenging, with projects like the Liquid Tensor Experiment taking years to formalize. While recent advances in large language models (LLMs) have shown promise in FMR, existing benchmarks like MiniF2F and ProofNet remain limited in both scale (containing only 244 and 186 test problems respectively) and scope, making it difficult to comprehensively evaluate LLMs' formal mathematical capabilities.
To address these limitations, we introduce FormalMATH—a large-scale Lean4-based benchmark designed to advance formal mathematical reasoning. Our project offers three main contributions:
Large-scale, Multi-domain Benchmark: We present a benchmark of 5,560 formally verified mathematical statements spanning diverse domains (algebra, geometry, calculus, number theory, and discrete mathematics) and difficulty levels (from high-school olympiads to undergraduate theorems). This dataset is 22.8× larger than the widely-used MiniF2F benchmark.
Efficient Human-in-the-Loop Autoformalization Pipeline: We develop a novel framework integrating multi-LLM autoformalization, semantic verification, and negation-based disproof strategies, achieving 72.09% accuracy in automatic verification before manual review, significantly reducing expert annotation costs.
Comprehensive Evaluation of LLM-based Theorem Provers: Our systematic evaluation reveals fundamental limitations in current systems: even the best models achieve only 16.46% success rate, showing significant domain bias and over-reliance on simplified tactics. Notably, we find that natural language solution guidance can paradoxically decrease proof success rates, suggesting important directions for future research.
FormalMATH covers a broad spectrum of mathematics from high school competition problems to undergraduate-level topics, comprising 5,560 rigorously validated statements. The benchmark includes over seven core disciplines at the high school level—such as algebra, number theory, discrete mathematics, and Euclidean geometry—and more than seven specialized domains at the undergraduate level, including calculus, linear algebra, abstract algebra, and series. Detailed domain distribution is shown in the figure below.
FormalMATH implements a human-in-the-loop pipeline for efficient dataset creation, addressing the scarcity of mature autoformalization tools. We leverage coding-specialized and theorem-proving LLMs (such as Qwen2.5-7B-Coder and Deepseek-prover-base) to generate formal Lean4 statements through best-of-N sampling. Our pipeline employs multiple validation strategies: First, a compile-and-filter approach yields high-quality training examples for fine-tuning. Next, semantic alignment is verified using powerful general-purpose LLMs (e.g., o1-mini, claude-3.5-Sonnet) through chain-of-thought reasoning, filtering out 60.7% of semantically misaligned statements. We further integrate negation-based disproof strategies to identify unprovable statements. Finally, 12 International Mathematical Olympiad medalist-level experts conduct verification. This pipeline demonstrates remarkable efficiency, retaining 72.1% of statements from the semantic verification stage while significantly reducing manual verification efforts.
We analyze and define common error patterns found in existing theorem provers, using advanced general-purpose LLMs (e.g., o4-mini) to automate error diagnosis and classification. Our method extracts verbalized features from proof failures, then assigns predefined error categories using a second prompt-driven step. We analyze 100 randomly selected failed proof attempts from each prover, and domain experts manually inspect and correct the annotations. The results, presented in a table, highlight the five most common error patterns and provide insights into proof failure modes.
Error | DeepSeek-SFT | DeepSeek-RL | Goedel | STP |
---|---|---|---|---|
Redundant Hypothesis | 18.0% | 34.0% | 27.0% | 24.0% |
Incomplete Proof | 77.0% | 62.0% | 86.0% | 44.0% |
Inabilities for Inequality | 8.0% | 13.0% | 20.0% | 1.0% |
Misuse of Auto tactics | 62.0% | 65.0% | 78.0% | 74.0% |
We explore several error patterns, starting with the improper use of automation tactics in Lean4 provers. Over-reliance on tactics like aesop, simp, and linarith can expand the search space and lead to non-termination or unsolvable goals. We also highlight the provers' difficulties handling complex inequalities, leading to proof failures, and redundant hypotheses that clutter the proof process. Incomplete proofs are another common issue, where provers generate unfinished proofs without properly justifying intermediate reasoning steps, leading to shallow or incomplete conclusions.
Lean4 Statement | Prover Output | Explanation |
---|---|---|
theorem omni_theorem_4000 (f : ℝ → ℝ) :
(∀ x y, (x + y ^ 2) * f (y * f x) = x * y * f (y ^ 2 + f x)) ↔
∀ x, f x = x ∨ f x = 0 ∨ ∃ a, a ≤ -1 ∨ a > 0 ∧ f x = if x ≠ -a ^ 2 then 0 else a := by
|
-- Proof Before
-- If f(x) ≠ 0, we need to show that there exists an a such that the conditions hold.
right
right
use 0
-- We choose a = 0 to satisfy the conditions.
simp
--Proof After
|
Improper use of Automation Tactics.
The prover selects the witness a=0 without checking the original side‑condition and then calls simp to close off the goal. This bypasses the essential case‑by‑case analysis of the core functional equation, resulting in an invalid proof that never constructs the required nonzero witness when needed.
|
theorem DEMIMathAnalysis_50 :
Tendsto (λ n => sqrt n * ∫ x in univ, 1 / (1 + x ^ 2) ^ n) atTop (N (sqrt π)) := by
|
rw [show sqrt = fun x => sqrt x by funext; rfl]
apply stendsto_atTop_mono (fun n => sqrt_le_sqrt <|
integral_mono_on _ _ <| fun x _ => by
gcongr
all_goals norm_num)
refine' tendsto_atTop_of_eventually_const (by simp)
all_goals
filter_upwards with _
<;> simp
|
Incomplete Proof.
The proof invokes tendsto_atTop_mono without establishing any of the analytical prerequisites. The final simp steps trivially handle eventual constancy but leave the main asymptotic argument unresolved.
|
theorem aime_all_2005_II_1 (n : ℕ) (hn : n ≥ 6)
(h : {s : Finset (Fin n) | s.card = 6}.ncard = 6 * {s : Finset (Fin n) | s.card = 3}.ncard) :
n = 13 := by
|
norm_num [Nat.choose] at h
revert n hn h
rintro n hn h
ring_nf at h
nlinarith
|
Redundant Hypothesis Introduction.
After norm_num already rewrites the binomial coefficients, the revert ; rintro sequence merely re‑introduces n, hn, and h in the same form, adding no new information and cluttering the context.
|
theorem algebra_528739 (a b c : ℝ) (ha : 0 < a) (hb : 0 < b) (hc : 0 < c) :
1 / (a^3 + b^3 + a * b * c) + 1 / (b^3 + c^3 + a *b * c) +
1 / (c^3 + a^3 + a * b * c) ≤ 1 / (a * b * c) := by
|
-- We start by simplifying the left-hand side using the fact that each term is non-negative.
have h₀ : 0 < a * b * c := by positivity
...
-- We then apply the inequality to each term.
rw [div_le_div_iff] <;>
nlinarith, [sq_nonneg (a - b), sq_nonneg (b - c), sq_nonneg (c - a)]
...
|
Inadequate Handling of Inequalities.
The solver attempts to apply nlinarith after a single div_le_div_iff, but the cyclic, high‑degree fractional structure exceeds its linear‐and‐quadratic reasoning scope.
|
@misc{yu2025formalmathbenchmarkingformalmathematical,
title={FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models},
author={Zhouliang Yu and Ruotian Peng and Keyi Ding and Yizhe Li and Zhongyuan Peng and Minghao Liu and Yifan Zhang and Zheng Yuan and Huajian Xin and Wenhao Huang and Yandong Wen and Ge Zhang and Weiyang Liu},
year={2025},
eprint={2505.02735},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.02735},
}