Introducing FATE: A Multi-level Formal Benchmark for Frontier Algebraic Problems

We are pleased to announce FATE (Formal Algebra Theorem Evaluation), a collection of formal benchmarks in algebra, progressing from undergraduate-level to research-level problems in Lean. The FATE benchmarks contain three levels of difficulty: FATE-M (Medium), FATE-H (Hard), and FATE-X (Expert), consisting of 150, 100, and 100 problems in abstract and commutative algebra. FATE-X represents the first formal benchmark surpassing Mathlib’s mathematical level, requiring models to understand and handle deep mathematical definitions being formalized for the first time. These benchmarks form a progressively challenging series leading to frontier mathematics, with each level demonstrating significantly increased mathematical and formalization difficulty. The problems were carefully curated by a unique team combining leading mathematicians and formalization experts. The FATE benchmark is now open source and available at the FATE repository.

To evaluate the capabilities of reasoning LLMs and specialized theorem provers on research-level mathematical problems, we measured their success rate in generating correct formal proofs. These results reveal that while models achieve a reasonable success rate on the medium-difficulty problems, their performance drops sharply on the harder benchmarks.

Key results are visualized below. Further details will be discussed in the subsequent sections.

Formalization accuracy across FATE-M/H/X benchmarks.
Accuracy drops dramatically from FATE-M to -H and -X.

Scope of the FATE Benckmarks

As LLMs continue to advance in reasoning capabilities, a natural question arises: Can these models contribute to modern mathematics—the honor of the human mind? Modern mathematics is primarily conducted in natural language, a flexible and intuitive medium that fosters creativity and rapid idea development. However, this informal approach can sometimes hide subtle errors or ambiguities. To address this, formalization translates these ideas into a rigorous, symbolic language that a computer can verify, providing an unparalleled level of certainty in the correctness of a proof. While informal mathematics excels in discovery and communication, formal mathematics offers indisputable soundness. Formalization thus provides a reliable method to ensure the correctness of LLM-generated mathematical reasoning, bridging the gap between human intuition and machine-verified certainty.

Evaluating research-level mathematics requires accurate verification of mathematical arguments. Yet, human evaluation of informal mathematical proofs faces major challenges: the more advanced the proofs, not only are there fewer experts who can reliably judge them, but the evaluation process also becomes significantly slower and more prone to error. This creates a critical bottleneck for iterative development. Formal verification addresses this by enabling rapid, precise assessment of mathematical correctness using computer software.

Recently, LLMs specialized in formal theorem proving have demonstrated impressive results in mathematical contests, introductory university-level problems, and even IMO-level problems. These achievements have been evaluated primarily through benchmarks that test competition-style problem solving (MiniF2F) and elementary undergraduate mathematics (ProofNet, FormalMATH, PutnamBench), which cover foundational but relatively shallow undergraduate topics compared to modern mathematics. But how far are they from solving research-level problems?

Mathematical contests often prioritize clever tricks over the systematic application of deep theoretical frameworks. Introductory university courses, by design, operate at a lower level of abstraction. Advanced mathematics is more open-ended, requiring not only the comprehension and application of broad and deeply nested abstract concepts but also the creation of new tools and theoretical frameworks. To address these gaps between mathematical contest and the modern mathematics, we build the FATE benckmarks. Unlike existing benchmarks that focus on contest problems or introductory undergraduate material, FATE is specifically designed to reflect the characteristics of modern mathematical research and to present problems involving profound, cutting-edge mathematical topics. Most significantly, FATE-X represents the first formal benchmark to surpass the mathematical level of Mathlib’s existing formalizations, requiring models to work with unfamiliar definitions and spontaneously formulate new mathematical concepts—capabilities essential for research contribution at the frontier of mathematics.

FATE benchmarks focus on abstract and commutative algebra. This area is particularly well-suited for testing research-level reasoning because its abstract and relatively self-contained nature allows for deep problem-solving without requiring extensive formalizations from other mathematical fields. Lean’s existing formalizations in algebra already provide a good foundation for research-level questions, and with minimal mathematical input from other topics, one can attack advanced problems in this area.

To best reflect the characteristics of modern mathematics and to build a difficulty progression from undergraduate-level to frontier mathematics, the FATE benchmarks are carefully curated:

The mathematical problems are selected from established sources: while FATE-M questions are drawn from undergraduate textbooks, FATE-H and FATE-X questions are sourced from graduate textbooks, Ph.D. qualifying exams, and research literature.
The benchmark questions are classified according to their difficulty. Broadly speaking, the three benchmarks have progressively increasing difficulty:

FATE-M consists primarily of basic level textbook exercises,
FATE-H contains challenging problems from abstract/commutative algebra final exams,
FATE-X includes problems at Ph.D. qualifying exam level and beyond.

All problems, especially those in FATE-H and FATE-X, were carefully chosen by a team led by mathematics professors from Peking University, working with a group of nearly twenty PhDs and postdocs. The selection process by mathematical experts guarantees that every problem is of high quality, mathematically profound, and presents considerable challenges that test the limits of both human and machine reasoning capabilities.

For a detailed distribution of problem types in FATE-H and FATE-X, please refer to the following charts:

FATE-H

FATE-X

Problem distributions by mathematical topics in FATE-H/X.

The charts show the focus shifting from abstract algebra in FATE-H to advanced commutative algebra topics in FATE-X.

Due to its scope and difficulty, FATE-X uses mathematical definitions not yet formalized in Mathlib. This introduces a distinctive formalization feature: new definitions are presented before the problem statement. Consequently, LLMs must derive necessary lemmas to use these definitions effectively. Moreover, to solve these challenging problems successfully, we expect models to discover mathematical phenomena, abstract them into useful lemmas, and when needed, spontaneously formulate new definitions. This capability is critical for research-level mathematical formalization and problem-solving.

The curation process is particularly challenging due to the rarity of experts worldwide who possess both the deep mathematical knowledge to understand these problems and the specialized skillset required for formalization in Lean. The creation of FATE-X itself was made possible by assembling a unique team with specialists in each domain as well as exceptional individuals with expertise in both areas.

Experiment Results

With this context of the FATE benchmarks, we revisit the experimental results in detail. The complete evaluation metrics are summarized below:

Formal accuracy results on FATE benchmarks

These results reveal a stark contrast in performance: while models achieve reasonable success rate on FATE-M, the best-performing model solves only 3 out of 100 problems in FATE-H, and no model produces any valid Lean proofs in FATE-X.

Benchmark and Evaluation Details

To support researchers working with FATE, we provide these implementation details:

Formalization Features

Each Lean file contains exactly one sorry after the final statement.
All statements begin with proper namespace declarations and natural language translations in Latex appear as comments before formal statements.
Files are self-contained (only depending on Mathlib).
Universe levels are fixed to avoid universe issues related to category theory.

Creation Process

FATE-M

Problems selected and formalized by the same team of students.

FATE-H/X

Collection: We sourced 400 candidate problems from graduate-level textbooks, PhD qualifying exams, and research literature. This initial collection was then rigorously curated through a collaborative process involving nearly twenty experts (PhDs and postdocs specializing in pure algebra) in a series of intensive sessions during four days. The entire selection process for the high-difficulty FATE-H and FATE-X tiers was supervised by leading mathematicians, who ensured that the final chosen problems are not only highly challenging but also mathematically profound, sound, and well-balanced across key topics.
Formalization: A subset of 200 problems was formalized in Lean by five experienced Mathlib contributors. This effort was organized into five dedicated workshops, each consisting of over five hours of focused work.
Review: The formalizations were checked and corrected by two Mathlib contributors with relevant mathematical expertise, who dedicated over 10 hours per person per benchmark. Additionally, external experts further checked the accuracy of formal statements.

Example Problems

FATE-M

import Mathlib
/--
Suppose $D$ is integral domain, $m$ and $n$ are coprime positive integers.
Prove that for any $a, b \in D$, if $a^{m}=b^{m}$ and $a^{n}=b^{n}$, we have $a=b$
-/
theorem eq_of_pow_eq_of_coPrime {R : Type*} [Ring R] [IsDomain R] (a b : R) 
    (m n : ℕ) (hm : m > 0) (hn : n > 0) (hmn : m.Coprime n) 
    (h₁ : a ^ m = b ^ m) (h₂ : a ^ n = b ^ n) : a = b := by
  sorry

FATE-H

import Mathlib

open IntermediateField RatFunc

/-- Let $\mathbb{F}_4$ be the field with $4$ elements, $t$ a transcendental over $\mathbb{F}_4$,
and $F =\mathbb{F}_4(t^4 + t)$ and $K =\mathbb{F}_4(t)$. Show that $K$ is Galois over $F$. -/
theorem isGalois_galoisField_adjoin_X_pow_four_add_X :
    IsGalois (GaloisField 2 2)⟮(X ^ 4 + X : RatFunc (GaloisField 2 2))⟯
    (RatFunc (GaloisField 2 2)) := by
  sorry

FATE-X

import Mathlib

/--
Let \( A \) be a domain and \( K \) its field of fractions.
\( x \in K \) is called almost integral if there exists an element 
\( r\in A, r \ne 0 \) such that \( rx^n \in A \) for all \( n \ge 0 \).
-/
def IsAlmostIntegral {A : Type} [CommRing A] [IsDomain A] 
    (x : FractionRing A) : Prop :=
  ∃ r : A, r ≠ 0 ∧ ∀ n : ℕ, ∃ y : A, 
    r • (x ^ n) = algebraMap A (FractionRing A) y

/--
\( A \) is called completely integrally closed if 
every almost integral element of \( K \) is contained in \( A \).
-/
def IsCompletelyIntegrallyClosed (A : Type) [CommRing A] 
    [IsDomain A] : Prop :=
  ∀ x : FractionRing A, IsAlmostIntegral x → 
    ∃ y : A, x = algebraMap A (FractionRing A) y

/--
Let \( A \) be a domain. Show that if \( A \) is completely 
integrally closed, so is \( A[X] \). 
-/
theorem completely_integrally_closed_polynomial_ring {A : Type} [CommRing A]
    [IsDomain A] (h : IsCompletelyIntegrallyClosed A) : 
    IsCompletelyIntegrallyClosed (Polynomial A) := by
  sorry

Evaluation Methodology

Formalization accuracy is pass@64 with a maximal generation length of 64k tokens. We use Lean’s native command to verify models’ outputs. The Lean toolchain version is kept consistent with each benchmark: v4.16.0 for FATE-M and FATE-H, while v4.19.0 for FATE-X.

Finale

The FATE benchmark represents our commitment to advancing large language models’ role in modern mathematical research. By creating formal benchmarks that mirror real research challenges, we aim to push LLM capabilities beyond competition-style problems toward genuine mathematical discovery—from graduate-level exercises to unsolved problems at mathematics’ frontiers. Given recent progress in the field, we expect the research community would gradually shift its focus towards modern mathematics. We hope this benchmark will serve as a resource for researchers working to enhance LLMs’ capabilities in advanced mathematical reasoning.

We’ve always been driven by one core goal: to empower large language models to assist with cutting-edge mathematical research. To us, this means creating an AI that can help mathematicians prove theorems and rigorously verify their proofs, but also one that can help them gain interesting insights, draw connections between seemingly unrelated concepts, and even find relevant results from other branches of mathematics or other scientific fields. As William Thurston noted in his influential essay “On Proof and Progress in Mathematics,” the ultimate goal is not just to prove theorems, but to understand mathematics. This deeper understanding is the main thing we hope AI can help augment.

Nonetheless, the act of proving theorems remains essential. It is not merely about verifying the correctness of a statement; the process itself is a powerful mechanism for building deeper understanding. For both humans and AI, working through a proof forces a detailed engagement with concepts, tools, and theories, solidifying comprehension in a way that passive learning cannot. As we mentioned earlier, natural language, for all its flexibility, struggles to provide this level of rigor, especially as mathematics becomes more advanced and informal arguments become harder to reliably evaluate. This is where formal systems like Lean become crucial tools, providing the necessary framework to ensure soundness and build a verifiable path toward true mathematical understanding. Thanks to the significant contributions of mathematicians like Kevin Buzzard, the Lean community has grown and Mathlib has become a powerful resource. However, this journey has been long and difficult, and significant challenges remain. As Terence Tao once noted, the learning curve for formalization in Lean is steep, and working with it can be roughly 25 times slower than with natural language.

To lay the groundwork for this ambitious goal, we’ve spent the past two years building a strong foundation. We realized that a truly collaborative effort required more than just technology; it also demanded a new generation of formalization experts. We organized summer schools to teach students how to build and work with formal infrastructure.

In parallel, driven by the goal of accelerating formalization and making it painless for mathematicians, we developed a series of tools to address the core challenge itself. We predict that in the future, conducting mathematics in a formal language will become increasingly common, making such AI tools indispensable. Our approach was guided by a fundamental question: what capabilities does an LLM need to become a genuine mathematical collaborator using formal language?

First, we realized a model couldn’t just “reason” in a vacuum. It needed to interact with the vast formal library of existing knowledge. To formalize new mathematical ideas, a model must be able to find and apply relevant theorems from Mathlib. This requires more than simple keyword matching; it requires semantic understanding. This led us to develop Leansearch, a semantic search engine that helps models and humans alike navigate the formal library by understanding the meaning behind a query.

Second, to train a model to master natural language mathematical ideas and translate them into formal proofs, we needed a massive amount of high-quality aligned data. Manually creating this data is slow and difficult. To address this, we first built Herald, a dataset of aligned informal and formal mathematical statements. Using this data, we then trained a model, the Herald Translator, to automatically translate natural language mathematical statements into their formal counterparts. This was a critical step in scaling our data pipeline and bootstrapping the training process.

Finally, we brought all the pieces together. By integrating our two main components—Leansearch and the Herald Translator—and fine-tuning them alongside other open-source provers on our in-house dataset, we developed a comprehensive theorem-proving system. This effort, which also incorporates Jixia, an interactive tool for Lean, culminated in REAL-Prover—a specialized model designed to generate formal proofs.

Each of these projects was a deliberate step in our exploratory journey, built to address a specific need we encountered. While developed to advance our own research, we shared them with the community as foundational tools. We believe that by building infrastructure that serves everyone, we can accelerate progress across the entire field. Every step of our journey and every open-source tool we’ve created is a testament to our ongoing commitment to this shared vision: empowering large language models to assist with cutting-edge mathematical research.

Acknowledgments

This work is supported in part by the New Cornerstone Investigator Program and in part by Ubiquant.

We are deeply grateful to the following individuals for their contributions to the selection of mathematical problems for the FATE benchmarks: Kaiyi Chen, Haocheng Fan, Yiqin He, Yongle Hu, Shanxiao Huang, Yudong Liu, Tian Qiu, Yinchong Song, Peihang Wu, Zhenhua Wu, Tianyi Xu, Zhehan Xu, Huanhuan Yu, Huishi Yu, Jiahong Yu, Zhanhao Yu and Xiao Yuan.

We also thank Nailin Guan, Zixun Guo, Yongle Hu, and Jingting Wang for their dedicated work in formalizing mathematical problems.

We extend our sincere thanks to Johan Commelin and Christian Merten for their careful review and corrections to the benchmark problems. We are also indebted to Siddhartha Gadgil, Filippo A. E. Nuccio and Floris van Doorn for their valuable discussions and suggestions.

Finally, we express our deepest appreciation to the Lean community for developing and maintaining Mathlib—without this foundational infrastructure, this work would not be possible.