FormalMATH-Bench

Benchmarking Formal Mathematical Reasoning of Large Language Models

¹The Chinese University of Hong Kong, ²Numina, ³Westlake University, ⁴M-A-P, ⁵2077AI,
⁶University of California, Los Angeles,
⁷Max Planck Institute for Intelligent Systems, Tübingen
^*Equal Contributions ^†Corresponding author

Can LLMs perform formal mathematical reasoning?

Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.

What is FormalMATH?

Formal mathematical reasoning (FMR) provides a rigorous axiomatic framework essential for automated proof validation. However, due to its inherent complexity, even expert mathematicians find it challenging, with projects like the Liquid Tensor Experiment taking years to formalize. While recent advances in large language models (LLMs) have shown promise in FMR, existing benchmarks like MiniF2F and ProofNet remain limited in both scale (containing only 244 and 186 test problems respectively) and scope, making it difficult to comprehensively evaluate LLMs' formal mathematical capabilities.

To address these limitations, we introduce FormalMATH—a large-scale Lean4-based benchmark designed to advance formal mathematical reasoning. Our project offers three main contributions:

Large-scale, Multi-domain Benchmark: We present a benchmark of 5,560 formally verified mathematical statements spanning diverse domains (algebra, geometry, calculus, number theory, and discrete mathematics) and difficulty levels (from high-school olympiads to undergraduate theorems). This dataset is 22.8× larger than the widely-used MiniF2F benchmark.

Efficient Human-in-the-Loop Autoformalization Pipeline: We develop a novel framework integrating multi-LLM autoformalization, semantic verification, and negation-based disproof strategies, achieving 72.09% accuracy in automatic verification before manual review, significantly reducing expert annotation costs.

Comprehensive Evaluation of LLM-based Theorem Provers: Our systematic evaluation reveals fundamental limitations in current systems: even the best models achieve only 16.46% success rate, showing significant domain bias and over-reliance on simplified tactics. Notably, we find that natural language solution guidance can paradoxically decrease proof success rates, suggesting important directions for future research.

Performance comparison of existing theorem provers on the FormalMATH benchmark.

Domain Distribution of FormalMATH

FormalMATH covers a broad spectrum of mathematics from high school competition problems to undergraduate-level topics, comprising 5,560 rigorously validated statements. The benchmark includes over seven core disciplines at the high school level—such as algebra, number theory, discrete mathematics, and Euclidean geometry—and more than seven specialized domains at the undergraduate level, including calculus, linear algebra, abstract algebra, and series. Detailed domain distribution is shown in the figure below.

Automated Dataset Creation Pipeline

FormalMATH implements a human-in-the-loop pipeline for efficient dataset creation, addressing the scarcity of mature autoformalization tools. We leverage coding-specialized and theorem-proving LLMs (such as Qwen2.5-7B-Coder and Deepseek-prover-base) to generate formal Lean4 statements through best-of-N sampling. Our pipeline employs multiple validation strategies: First, a compile-and-filter approach yields high-quality training examples for fine-tuning. Next, semantic alignment is verified using powerful general-purpose LLMs (e.g., o1-mini, claude-3.5-Sonnet) through chain-of-thought reasoning, filtering out 60.7% of semantically misaligned statements. We further integrate negation-based disproof strategies to identify unprovable statements. Finally, 12 International Mathematical Olympiad medalist-level experts conduct verification. This pipeline demonstrates remarkable efficiency, retaining 72.1% of statements from the semantic verification stage while significantly reducing manual verification efforts.

Human-AI collaboration pipeline for formal mathematical statement creation and filtering.

Setup

We analyze and define common error patterns found in existing theorem provers, using advanced general-purpose LLMs (e.g., o4-mini) to automate error diagnosis and classification. Our method extracts verbalized features from proof failures, then assigns predefined error categories using a second prompt-driven step. We analyze 100 randomly selected failed proof attempts from each prover, and domain experts manually inspect and correct the annotations. The results, presented in a table, highlight the five most common error patterns and provide insights into proof failure modes.

Error	DeepSeek-SFT	DeepSeek-RL	Goedel	STP
Redundant Hypothesis	18.0%	34.0%	27.0%	24.0%
Incomplete Proof	77.0%	62.0%	86.0%	44.0%
Inabilities for Inequality	8.0%	13.0%	20.0%	1.0%
Misuse of Auto tactics	62.0%	65.0%	78.0%	74.0%

Percentage of different Lean4 error patterns in LLM-based provers.

Case Study

We explore several error patterns, starting with the improper use of automation tactics in Lean4 provers. Over-reliance on tactics like aesop, simp, and linarith can expand the search space and lead to non-termination or unsolvable goals. We also highlight the provers' difficulties handling complex inequalities, leading to proof failures, and redundant hypotheses that clutter the proof process. Incomplete proofs are another common issue, where provers generate unfinished proofs without properly justifying intermediate reasoning steps, leading to shallow or incomplete conclusions.

Lean4 Statement	Prover Output	Explanation
theorem omni_theorem_4000 (f : ℝ → ℝ) : (∀ x y, (x + y ^ 2) * f (y * f x) = x * y * f (y ^ 2 + f x)) ↔ ∀ x, f x = x ∨ f x = 0 ∨ ∃ a, a ≤ -1 ∨ a > 0 ∧ f x = if x ≠ -a ^ 2 then 0 else a := by	-- Proof Before -- If f(x) ≠ 0, we need to show that there exists an a such that the conditions hold. right right use 0 -- We choose a = 0 to satisfy the conditions. simp --Proof After	Improper use of Automation Tactics. The prover selects the witness a=0 without checking the original side‑condition and then calls simp to close off the goal. This bypasses the essential case‑by‑case analysis of the core functional equation, resulting in an invalid proof that never constructs the required nonzero witness when needed.
theorem DEMIMathAnalysis_50 : Tendsto (λ n => sqrt n * ∫ x in univ, 1 / (1 + x ^ 2) ^ n) atTop (N (sqrt π)) := by	rw [show sqrt = fun x => sqrt x by funext; rfl] apply stendsto_atTop_mono (fun n => sqrt_le_sqrt <\| integral_mono_on _ _ <\| fun x _ => by gcongr all_goals norm_num) refine' tendsto_atTop_of_eventually_const (by simp) all_goals filter_upwards with _ <;> simp	Incomplete Proof. The proof invokes tendsto_atTop_mono without establishing any of the analytical prerequisites. The final simp steps trivially handle eventual constancy but leave the main asymptotic argument unresolved.
theorem aime_all_2005_II_1 (n : ℕ) (hn : n ≥ 6) (h : {s : Finset (Fin n) \| s.card = 6}.ncard = 6 * {s : Finset (Fin n) \| s.card = 3}.ncard) : n = 13 := by	norm_num [Nat.choose] at h revert n hn h rintro n hn h ring_nf at h nlinarith	Redundant Hypothesis Introduction. After norm_num already rewrites the binomial coefficients, the revert ; rintro sequence merely re‑introduces n, hn, and h in the same form, adding no new information and cluttering the context.
theorem algebra_528739 (a b c : ℝ) (ha : 0 < a) (hb : 0 < b) (hc : 0 < c) : 1 / (a^3 + b^3 + a * b * c) + 1 / (b^3 + c^3 + a b c) + 1 / (c^3 + a^3 + a * b * c) ≤ 1 / (a * b * c) := by	-- We start by simplifying the left-hand side using the fact that each term is non-negative. have h₀ : 0 < a * b * c := by positivity ... -- We then apply the inequality to each term. rw [div_le_div_iff] <;> nlinarith, [sq_nonneg (a - b), sq_nonneg (b - c), sq_nonneg (c - a)] ...	Inadequate Handling of Inequalities. The solver attempts to apply nlinarith after a single div_le_div_iff, but the cyclic, high‑degree fractional structure exceeds its linear‐and‐quadratic reasoning scope.

BibTeX

@article{yu2025formalmath, title={FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models}, author={Yu, Zhouliang and Peng, Ruotian and Ding, Keyi and Li, Yizhe and Peng, Zhongyuan and Liu, Minghao and Zhang, Yifan and Zheng, Yuan and Xin, Huajian and Huang, Wenhao and Wen, Yandong and Liu, Weiyang}, journal={arXiv preprint arXiv:2505.02735}, year={2025} }

Reset	Size	Date	Sampling Budget	Overall
DeepSeek-V2-671B	671B	2025-05-22	32	28.31
DeepSeek-V2-7B	7B	2025-05-22	32	22.41
Kimina-Prover	7B	2025-4-14	32	16.46
STP	7B	2025-03-20	32	13.87
Goedel-Prover-SFT	7B	2025-02-11	32	13.53
DeepSeek-V1.5-RL	7B	2024-08-15	32	10.18
DeepSeek-V1.5-SFT	7B	2024-08-15	32	8.97
InternLM-Prover	7B	2024-10-21	1×32×100	11.13
BFS-Prover	7B	2025-2-25	1×32×100	1.16

Reset	Size	Date	Sampling Budget	Overall
DeepSeek-V2-671B	671B	2025-05-22	32	56.00
DeepSeek-V2-7B	7B	2025-05-22	32	51.76
Kimina-Prover	7B	2025-4-14	32	48.94
DeepSeek-V2-671B	671B	2025-05-22	3200	61.88
DeepSeek-V2-7B	7B	2025-05-22	3200	55.06
STP	7B	2025-03-20	32	48.59
STP	7B	2025-03-20	3200	53.17
Goedel-Prover-SFT	7B	2025-02-11	32	46.70
Goedel-Prover-SFT	7B	2025-02-11	3200	49.41
DeepSeek-V1.5-RL	7B	2024-08-15	32	47.98
DeepSeek-V1.5-RL	7B	2024-08-15	3200	50.35
DeepSeek-V1.5-SFT	7B	2024-08-15	32	40.40
DeepSeek-V1.5-SFT	7B	2024-08-15	3200	46.82
InternLM-Prover	7B	2024-10-21	1×32×100	7.87
InternLM-Prover	7B	2024-10-21	32×32×100	25.65
BFS-Prover	7B	2025-2-25	1×32×100	27.10
BFS-Prover	7B	2025-2-25	32×32×100	45.88

FormalMATH