Today, we announce the FATE (Formal Algebra Theorem Evaluation) benchmarks, a collection of formal benchmarks in algebra, progressing from undergraduate-level to research-level problems in Lean.
Recently, LLMs specialized in formal theorem proving have demonstrated impressive results on mathematical contests and even IMO-level problems. However, such contests differ from modern mathematics in key ways. For instance, while contests often emphasize clever tricks, advanced mathematics is more open-ended, requiring not only the comprehension and application of broad and deeply nested abstract concepts but also the creation of new tools and theoretical frameworks.
To address the gaps and provide a progressive path from undergraduate-level to frontier mathematics, we created the FATE benchmark, testing model performance on exercises reflecting modern mathematical research. Such a benchmark is crucial for guiding the development of models capable of higher-level mathematical reasoning.
The FATE benchmarks contain three levels of difficulty: FATE-M (Medium), FATE-H (Hard), and FATE-X (Extra Hard), consisting of 150, 100, and 100 problems in abstract and commutative algebra. FATE-X represents the first formal benchmark surpassing Mathlib’s mathematical level, requiring models to understand and handle deep mathematical definitions being formalized for the first time. These benchmarks form a progressively challenging series leading to frontier mathematics, with each level demonstrating significantly increased mathematical and formalization difficulty. The problems were carefully curated by a unique team combining leading mathematicians and formalization experts. The FATE benchmark is now open source and available at the FATE repository.
To evaluate reasoning LLMs and theorem provers on research-level problems, we measured their success rate in generating correct formal proofs. These results reveal that while models achieve a reasonable success rate on the medium-difficulty problems, their performance drops sharply on the harder benchmarks.
Key results are visualized below.
Accuracy drops dramatically from FATE-M to -H and -X.
For more details, visit our blog post about FATE.