For a brief summary of what we have done, see News.
Introduction
Recent progress in automated theorem proving has begun to close the gap between competition-level benchmarks and research-level formalization. For a long time, the community was focused on isolated benchmarks like miniF2F—testing if automated systems could produce roughly hundreds of lines of Lean code to solve high-school or competition-level problems. Such benchmarks primarily assess foundational proving skills: synthesizing relatively short, self-contained proofs against a fixed set of library dependencies. But the real challenge lies elsewhere. Formalizing research-level mathematics at the repository scale demands an entirely different set of capabilities: managing cross-file dependencies, designing and integrating new definitions that extend existing libraries, decomposing long and intricate mathematical arguments into chains of reusable lemmas, and maintaining a coherent codebase architecture across an entire project.
Before describing recent progress, it is worth clarifying two concepts that are often conflated in this area: autoformalization and automated theorem proving. Autoformalization, broadly speaking, refers to the translation of mathematical content from informal natural language into a formal language such as Lean—converting human-written proofs, definitions, and statements into machine-verifiable form. Automated theorem proving, by contrast, focuses on generating proofs—whether informal or formal—for mathematical claims. In principle, these are distinct tasks: one is a cross-modal translation problem, the other a search or reasoning problem. In practice, however, the boundary between them blurs significantly as mathematical difficulty increases. An informal proof of a research-level theorem typically contains substantial gaps—steps that are obvious to an expert but far from trivial to formalize. Filling these gaps requires genuine theorem proving, not mere syntactic translation. Moreover, the structural differences between informal and formal mathematics mean that even “straightforward” translation is rarely straightforward: definitions may not align with library conventions, proof strategies that are natural in one modality may be infeasible in the other, and the granularity of reasoning steps differs fundamentally between the two representations (see, e.g., Dong and Ju, 2026 for a detailed discussion). This deep entanglement explains why most agent-based formalization systems in practice—including ours—do not sharply separate the two tasks, but instead treat formalization as an integrated process that interleaves translation with proving.
We have seen staggering progress recently. On competition-level benchmarks such as Putnam 2025, systems including AxiomProver, Harmonic Aristotle, Seed-Prover, and Numina-Lean-Agent have all delivered exceptional results, each scoring above 10/12, with AxiomProver and Numina-Lean-Agent notably achieving perfect scores. Beyond these competition results, each system brings distinct strengths to the frontier: Seed-Prover pioneers whole-proof reinforcement learning with a lemma-based generation style; both AxiomProver and Aristotle have contributed to the formalization of open problems posed by Erdős; Harmonic provides a public API for Aristotle as a step-level prover; and Numina-Lean-Agent illustrates that augmenting a general coding agent with the right tool-use extensions can match purpose-trained provers. Meanwhile, Math, Inc.’s Gauss recently completed the formal verification of Viazovska’s sphere-packing results in dimensions 8 and 24, and many formalization experts have embraced frontier AI tools to tackle research-level tasks with impressive results. Inspired by this progress—and by the observation that the integration of formalization and proving is where the deepest challenges remain—we built Archon. Crucially, none of these advances would be possible without Mathlib, the actively maintained mathematical library for Lean 4, which now contains over 263,000 theorems and 126,000 definitions contributed by more than 770 community members. This sustained collective effort provides the formal infrastructure on which all current formalization systems depend.
We believe that by further integrating an AI-centric understanding of how LLMs reason and behave, we can more fully unleash the power of existing frontier models and agent systems, creating a synergy between mathematical expertise and agentic orchestration that neither discipline achieves alone. By treating a frontier model not as a simple code generator but as a strategic agent capable of planning, reflecting, and adapting, we arrive at a level of autonomy that feels less like operating a tool and more like collaborating with an expert colleague.
In our evaluations, we observe that with current techniques, frontier models have become capable of independently generating the overall architecture for repository-level formalization, then incrementally populating and refining it until the entire formal project compiles. Archon manifested this potential by fully automating the formalization of FirstProof Problem 6 and successfully completing Problem 4 in a nearly automated setting, requiring only a single natural language hint for one local lemma.
System Architecture and Technical Design
Our initial methodology was built on providing guidance and support while interfering as little as possible with the decision-making of powerful commercial models. A standalone Claude Code agent is already capable of complex analysis and planning before execution.
However, in practice we encountered two critical limitations of the single-agent approach. First, when faced with a full repository, the agent tends to spend excessive time on holistic analysis, causing its context window to fill with exploratory notes before meaningful proving begins. Second, and more damaging to overall progress, agents exhibit pronounced task-aversion when confronting difficult lemmas. When facing a challenging obligation that we estimate would require roughly a thousand lines of Lean code involving infrastructure not yet present in Mathlib, Claude Opus 4.6 will sometimes disregard explicit instructions to continue working and to recursively expand natural-language sub-proofs for missing results. Instead, it insists that the lemma demands prohibitive infrastructure, leaves a sorry, and ceases further attempts, even when our formalization experts assess the obligation to be well within the model’s capability.
We attribute both failure modes to the outsized influence that context exerts on agent behavior. A large, sprawling repository makes the task appear intractable; an agent’s own failed attempts and pessimistic annotations further erode its willingness to persevere in subsequent sessions. Based on this diagnosis, we introduced a second agent, the Plan Agent, operating in a fresh context to decompose the task into focused subproblems and to provide each Lean Agent session with a clear work plan, targeted guidance, and a constrained scope. This improved context environment substantially mitigated both context explosion and task-aversion, and became the core rationale for our dual-agent cyclic architecture. In practice, using OpenClaw as a dispatcher to orchestrate this loop yields marked improvements.
Archon divides the formalization process into three phases.
In Phase 1 (Scaffolding), the Lean Agent analyzes the informal proof and constructs the initial file structure: splitting the proof into modules, defining theorem signatures, and placing sorry placeholders at each proof obligation. The quality of this decomposition directly affects the efficiency of all subsequent work; however, the decomposition is not fixed and can be revised as the proof develops.
In Phase 2 (Proving), the Lean Agent and Plan Agent enter an iterative cycle. The Lean Agent attempts to fill sorry placeholders; when it encounters a difficulty it cannot resolve independently, it returns diagnostic information to the Plan Agent. The Plan Agent, operating from a global perspective, identifies the underlying bottleneck, re-evaluates the proof strategy, and proposes a concrete approach for resolution before dispatching a revised task to the Lean Agent. This cycle continues until all proof obligations are closed. When Phase 1 produces a sufficiently modular decomposition where proofs are independent of one another, Phase 2 can deploy multiple Lean Agents in parallel, filling independent sorry placeholders concurrently and substantially accelerating the overall process.
In Phase 3 (Verification and Polish), the completed proof undergoes a final pass: the system verifies compilation, confirms the absence of sorry, axiom, and other escape hatches, and performs a dedicated code quality improvement pass to extract reusable lemmas, reduce proof term complexity, and remove unnecessary set_option maxHeartbeats overrides.
The effectiveness of Phase 2 depends on the Plan Agent’s ability to select the right intervention when a Lean Agent reports failure. We equip the Plan Agent with three primary strategies:
-
Detailed Informal Support. The Plan Agent invokes the Informal Agent to generate a step-by-step natural-language proof for the specific lemma that blocked the Lean Agent. This is the most common intervention: the Lean Agent receives not just a task reassignment but a complete mathematical roadmap for the problem.
-
Decomposition. When a single proof is too large or complex for the Lean Agent to handle in one pass, the Plan Agent instructs the Lean Agent to break it into smaller, independently provable sub-lemmas. The Lean Agent then performs the actual decomposition in code, with the Plan Agent’s analysis reducing the risk of unproductive splits.
-
Informal Re-routing. When the Lean Agent’s failure stems not from a missing proof step but from a formalization route that cannot be completed—for example, a dependency on Mathlib infrastructure that does not exist—the Plan Agent invokes the Informal Agent to propose an entirely alternative proof strategy that avoids the blocked path. This is the strategy that resolved the Rouché’s theorem obstacle in Problem 4.
Because these three strategies address structurally different failure modes—insufficient mathematical detail, excessive proof complexity, and infeasible formalization routes—the Plan Agent can match its intervention to the nature of the obstacle rather than repeating a single generic response, making the overall loop substantially more robust.
Our agents are provided with a specialized suite of tools, implemented via terminal CLI and Model Context Protocols (MCPs), that enable interaction with the formal environment and retrieval of external mathematical knowledge:
-
Ask Informal Agent. To compensate for gaps in Opus’s mathematical reasoning and to operate in a cleaner context, we provide an Informal Agent with basic retrieval and self-reflection capabilities, powered by Gemini or GPT. Empirically, Gemini tends to deliver the strongest mathematical performance among the three models. When the Lean Agent requires mathematical guidance beyond what the provided informal proof covers and cannot resolve the gap on its own, the Informal Agent generates detailed natural-language sub-proofs or proposes alternative proof strategies. This idea of delegating mathematical reasoning to a separate model draws on a similar design in Numina-Lean-Agent.
-
LeanSearch. Our previously published tool LeanSearch is widely adopted in the Lean community as a practical accelerator for formalization work. For Archon, we further strengthened its retrieval accuracy across all entry types, particularly definitions, and updated it to the latest Lean and Mathlib versions. The enhanced LeanSearch equips our agents with fuzzy search capabilities over large-scale repositories. The improved version is now publicly available.
-
Lean LSP MCP. A widely adopted MCP component that accelerates compilation feedback and provides well-formatted Lean diagnostics to the agent. Its built-in retrieval component includes a LeanSearch integration; we adjusted the tool-usage prompts to steer the agent toward our upgraded LeanSearch.
-
Web Search. Allows agents to retrieve published papers and standard mathematical results from the web, enabling them to automatically expand the proofs of referenced theorems not yet available in local libraries.
-
Memory Management. Before each context compression or automatic session restart—inevitable in long-running tasks that repeatedly approach the context limit—the agent summarizes and persists its architectural reasoning about the overall proof structure, failed approaches it has already attempted, and techniques it has learned during the run. This memory file serves as a distilled record of the agent’s key judgments, making it straightforward for us to trace its decisions, diagnose failures, and improve the system for subsequent runs.
Providing tools alone is not sufficient; clear descriptions of each tool’s purpose and explicit guidance on when and how to use them are essential for the model to deploy them effectively. This behavioral guidance is encoded in what we call skills—structured instruction files, built upon the Lean 4 Skills framework, that shape the agent’s working patterns, code conventions, and corner-case handling strategies to mirror authentic formalization practice.
Selecting the Problems
To evaluate Archon’s capability on research-level, repository-scale formalization, we needed a benchmark consisting of genuine mathematical results with detailed informal proofs—problems where the challenge lies in the formalization process itself rather than in discovering new mathematics. FirstProof provides exactly this: a set of ten research-level problems, each accompanied by a complete informal proof, originally designed to test whether frontier AI systems can produce correct solutions to genuine research mathematics. For our purposes, the availability of detailed informal proofs makes it an ideal testbed for evaluating autoformalization systems.
Our selection from this set was guided by two considerations: either the required infrastructure is missing from Mathlib, or the core challenge of the problem does not lend itself well to formalization. As noted in the fourth paragraph of our News post, we focused on problems where the existing Mathlib ecosystem provided sufficient foundational support, allowing us to isolate and test the system’s formalization abilities rather than its capacity to build missing library infrastructure from scratch.
Below are the detailed reasons for each excluded problem.
| Problem | Main area | Why it was not selected |
|---|---|---|
| 1 | Stochastic PDE / infinite-dimensional analysis | Concerns quasi-invariance of the $\Phi^4_3$ measure under smooth shifts, and requires substantial infrastructure in stochastic PDE, renormalization, and infinite-dimensional measure theory. |
| 2 | Representation theory over $p$-adic fields | Concerns Whittaker models and Rankin—Selberg integrals over non-archimedean local fields, and requires substantial infrastructure in local representation theory, including Howe vectors, Whittaker models, and local zeta integrals. |
| 3 | Probability / Markov chains / Macdonald theory | Concerns the construction of a nontrivial Markov chain with a prescribed stationary distribution arising from interpolation ASEP and Macdonald polynomials, and requires substantial infrastructure in Markov chains, stationary distributions, and interpolation Macdonald theory. |
| 5 | Equivariant stable homotopy theory | Concerns equivariant stable homotopy theory and $N_\infty$-operads, and requires substantial infrastructure in equivariant spectra, geometric fixed points, transfer systems, and slice filtrations. |
| 7 | High-dimensional geometric topology | Requires substantial infrastructure in high-dimensional geometric topology, especially lattices, torsion phenomena, and $\mathbb{Q}$-acyclic universal covers. |
| 8 | Symplectic geometry | Concerns a Lagrangian smoothing statement and requires substantial infrastructure in Hamiltonian isotopies, Lagrangian topology, and symplectic-topological obstructions. |
| 9 | Algebraic / structural characterization | Asks for an explicit polynomial map $F$ with bounded-degree coordinate functions and an if-and-only-if characterization of rank-one scaling structure, so a substantial part of the difficulty lies in discovering the right construction rather than in proof-heavy symbolic reasoning. |
| 10 | Numerical linear algebra / algorithm design | Asks for a matrix-free preconditioned conjugate-gradient method, including a concrete preconditioner, efficient matrix-vector products, and a complexity analysis avoiding any $O(N)$ computation, so much of the difficulty lies in algorithmic design rather than in proof-heavy symbolic reasoning. |
What We Have Learnt from the Process
Before presenting and analyzing our results on FirstProof Problems 4 and 6, we share several observations from the process of building and iterating on Archon. These insights shaped the system’s final design and may be useful to others working on agent-based formalization. Among them, we found that workflow design has a decisive impact on agent performance, exceeding that of model choice or prompt engineering.
During early runs on Problem 6, the agent spent disproportionate time reading files and reasoning about strategy while producing very little code. Shortening each task instruction yielded immediate improvement, which led us to separate strategic reasoning from proof execution entirely. The Plan Agent decomposes goals and provides supporting analysis; the Lean Agent receives a focused task and works in a clean context. This separation also addressed a subtler issue: when facing a difficult proof, a single agent would read extensively and generate analyses while avoiding actual proof attempts. A fresh context with concrete backing from the Plan Agent reduced this avoidance behavior markedly.
Current LLMs optimize for whichever objective offers the clearest binary signal, at the expense of all others. Even when explicitly instructed to write clean, reusable code, Claude Code consistently prioritized the one criterion it could verify autonomously: compilation. In Problem 6, several trace inequalities for positive semidefinite matrices were natural candidates for extraction into standalone lemmas contributable to Mathlib; the agent kept them inlined because extraction would not advance the immediate compilation goal. In Problem 4, the agent defined its own Laplacian matrix and wrote a proof of its equivalence with the Mathlib definition, but never leveraged that equivalence to rewrite subsequent proofs in terms of the library type; as long as compilation passed with its custom definition, the agent saw no reason to pursue compatibility with the broader ecosystem. Similarly, when proof terms grew to over a thousand lines with heavy reliance on automated tactics, the compiler approached its heartbeat limit—the bound on reduction steps the kernel performs during type-checking. Hitting this limit normally signals that a proof term is unnecessarily complex: a practitioner would introduce intermediate steps, replace brute-force automation with targeted tactics, or factor out lemmas. The agent instead treated it as a resource constraint and increased set_option maxHeartbeats, leaving the underlying complexity untouched. This bias motivated a dedicated Polish phase, where the agent revisits its output with the sole objective of improving code quality. In our experiments, the Polish phase eliminated all set_option maxHeartbeats overrides and extracted an average of four reusable lemmas per problem.
Pre-generating a complete informal proof before formalization eliminates a major source of wasted computation. Without one, the agent repeatedly re-derived natural-language arguments, questioned intermediate steps, and explored alternative paths before reverting. Once we provided a complete informal proof as a reference file, this hesitation disappeared. The finding motivated the Informal Agent: a dedicated component operating in its own context, equipped with retrieval.
Some formalization obstacles arise not from mathematical difficulty but from misalignment between standard proofs and available formal infrastructure. In Problem 4, the system stalled on a lemma about continuous variation of polynomial roots. The natural proof uses Rouché’s theorem, which requires path integrals—a well-known gap in Mathlib. Previous systems rarely encountered this kind of obstacle: they failed at syntax and tactic selection long before formalization route choice became relevant.
We tested the frontier models directly: unconstrained, they invariably produced the complex analysis proof. This preference likely reflects the training distribution: proofs via complex analysis are widespread in textbooks and online resources, whereas the real analysis alternative via Vieta’s formulas, though equally straightforward, is rarely documented. When explicitly instructed to avoid Rouché’s theorem, the models did switch to a real analysis approach, but the resulting natural-language proofs consistently left certain local arguments vague and underspecified, even when asked for a gap-free derivation. This forced human intervention: a PhD student in pure mathematics on our team identified the Vieta’s formulas route. We injected one sentence into the agent’s context: “Use Vieta’s formulas.” The Informal Agent generated a complete proof; the Lean Agent formalized it within the hour. No Lean code, no detailed instructions—one natural-language hint was enough to reroute the entire proof past a structural gap in the library.
With the rapid improvement in LLM capabilities, the central difficulty of autonomous formalization is no longer syntax or tactic selection—it is workflow design, context management, and formalization route planning. Each component of our pipeline addresses a concrete failure mode we observed: plan-execution separation cured avoidance behavior, the Informal Agent eliminated redundant reasoning, formalization route guidance resolved infrastructure misalignment, and the Polish phase corrected the agent’s tendency to ignore code quality in favor of compilation.
What the Results Tell Us
Archon produced fully compilable, verifiable formal proofs of research-level theorems, with internal structure clean enough for rapid human review; the complete code is available at GitHub repo. The system organized each formalization into well-separated modules: auxiliary lemma libraries and main theorem files, with the proof chain following the mathematical argument faithfully. In Problem 4, where the informal proof proceeds through several intermediate results of increasing strength, the formal code preserved this layered structure with each variant of the main theorem clearly delineated. In Problem 6, a trace inequality used twice in the proof — $\operatorname{tr}(XC) \le \operatorname{tr}(YC)$ whenever $0 \preceq X \preceq Y$ and $C \succeq 0$ — was correctly identified as a reusable result and extracted into a standalone lemma, trace_mul_le_of_loewner. The system also performs its own semantic check phase, verifying that formal definitions and theorem statements faithfully capture the intended mathematics. When a human expert subsequently reviews the output, the system can quickly walk them through the directory structure and identify the key definitions and statements whose correctness implies that the entire formalization is semantically sound. In our experience, this review takes roughly five minutes per problem.
The system’s output also demonstrates strong technical command of Lean and Mathlib. Its code consistently follows Mathlib style in naming, module structure, and documentation, and its proofs are written in a form that a human formalizer can readily inspect and extend. More importantly, the agent handled conventions and edge cases that require real understanding of the library rather than superficial pattern matching. In Problem 4, for example, the standard mathematical presentation involves division, but in Mathlib division by zero is defined to return zero. The agent recognized that this would distort the intended object and introduced invPhiN_poly as a working definition that avoided the pitfall while preserving the proof strategy. When technical errors did arise—whether from incorrect API assumptions or failed compilations—the system used diagnostics to identify and repair them instead of compounding the mistake. This is the level of library awareness required for large formalizations to succeed in practice.
Beyond full autonomy, the system supports productive human interaction at a manageable cost of attention. A persistent challenge in earlier formalization systems was that Lean’s serial compilation interleaves fundamental errors (wrong proof strategy, missing infrastructure) with superficial ones (tactic misuse, Mathlib API mismatches), making it difficult for a human observer to assess whether the system is on the right mathematical track. Archon’s architecture resolves this: the Plan Agent’s summaries and the semantic check phase surface strategic issues while the Lean Agent handles technical errors autonomously. In practice, this meant our mathematician could interact once every 24 hours, spending roughly 20 minutes reviewing the agent’s progress reports and Lean code, and provide effective guidance. When the agent stalled on Problem 4 due to a Rouché’s theorem dependency that could not be formalized, a single sentence—“Use Vieta’s formulas”—was enough to redirect the entire proof. Structural feedback works similarly: if a reviewer identifies a section that should be refactored, they can say so in natural language and the agent executes the change. That said, the final proofs we report were produced by the fully autonomous pipeline; human interaction was used during development to study the system’s responsiveness, not to produce the delivered output.
Lean’s design makes these results consequential beyond the specific problems we solved. In Lean 4, the type checker is the final arbiter of correctness: if a proof compiles, introduces no new axioms, and contains no sorry placeholders, then the claimed theorem is proven—regardless of how the proof was constructed. A reviewer does not need to read or understand the proof body; only the definitions and theorem statements matter. This is qualitatively different from reviewing an informal paper proof, where every intermediate step requires human scrutiny. Our system produces output that meets exactly this standard. Combined with the semantic check that confirms the formal statements match the intended mathematics, the verification task reduces from “check the entire proof” to “check the statement”—minutes rather than hours.
We believe these results point toward a practical shift in how mathematical theorems are validated. A mathematician could submit a detailed natural-language proof and, within days, receive a machine-checked formal verification whose correctness can be independently confirmed by anyone with a Lean installation. The implication for peer review is direct: the correctness of a new result would no longer rest solely on the reviewer’s ability to follow an informal argument, but on a mechanically verified proof whose validity can be checked independently. This would increase the credibility of published results and substantially reduce the time and expertise required for review.
Looking Forward
For mathematicians working in Lean, the main bottleneck has always been labor. As Terence Tao has estimated, formalizing a proof currently takes roughly ten times the effort of writing it out informally. This ratio means that formalization has remained a specialized craft: most mathematicians cannot justify spending weeks translating a result that took days to prove. Our results show that this ratio is changing. For Problems 4 and 6, Archon completed in days what would have taken a human expert weeks, at a cost of under $2,000 total. We are approaching a regime where formalization is no longer the bottleneck.
However, significant challenges remain before autonomous formalization becomes routine. The most instructive is the class of problems where formalization cannot simply follow the informal proof step by step. In Problem 4, the natural proof of continuous variation of polynomial roots goes through Rouché’s theorem, which requires path integrals—a well-known gap in Mathlib. This is not an isolated edge case; it is a common situation in formalization practice: the standard textbook proof relies on infrastructure that does not yet exist in the formal library, and the agent must either build that infrastructure from scratch or find an alternative route that avoids it. Both options are genuinely difficult. Our agent showed early signs of this capability—when told to avoid complex analysis, it pivoted to a real-variable approach—but it still required a one-sentence human hint to identify the specific alternative (Vieta’s formulas). For harder formalizations, where such detours are more frequent and more consequential, the agent’s current ability to independently plan formalization routes is not yet sufficient. Separately, there is substantial work to do on code reusability. The agent’s output compiles and proves the claimed theorems, but many intermediate lemmas remain tightly coupled to the specific proof context rather than generalized for contribution to Mathlib.
Despite these gaps, we are optimistic about the trajectory. As frontier models improve in mathematical reasoning and libraries like Mathlib continue to grow, the time and cost of formalization will keep falling. We expect the near-term future to look like this: a mathematician writes a detailed natural-language proof, an autonomous system formalizes it within days, and a human expert confirms semantic alignment in minutes. The proof’s correctness then rests not on peer review of informal arguments but on a mechanically verified formal proof that anyone with a Lean installation can check. This would make peer review faster, more reliable, and more accessible. In mathematics, as in software engineering, machine-verified correctness is shifting from an aspiration to an expectation.
Community and Next Steps
An earlier version of Archon was first tested in collaboration with Bas Spitters and Ravi Vakil at the Aarhus Mathematics & AI Workshop in late January 2026. In that trial, the system did not yet follow the blueprint’s prescribed proof strategy strictly—it deviated from the pointless scheme approach—but still demonstrated promising performance on a research-level formalization task. The experience gave us concrete feedback that directly informed the architectural improvements described in this post.
Since then, we have continued to apply Archon in collaboration with the broader community. Working with Prof. Ingo Althöfer, we used Archon to formalize two improved variants of the Problem 6 constant, based on his new informal proofs that strengthen the lower bound from ε/256 to ε/20 and (3/40)·ε, at a total cost of roughly $50.
In the near term, we are actively using Archon on selected formalization projects and working to stabilize the run configuration so that others can use it with their own API keys. We plan to integrate Archon into our Mozi Lean Copilot system, making it accessible to the community as a hosted service. Once the system is stable enough for reproducible use, we intend to open-source it. We look forward to close collaboration with mathematicians and formalization practitioners, contributing what we have learned about LLM-driven automation back to the community to advance the state of formalization.