Archon: Toward Fully Autonomous Formalization of FirstProof's Research-Level Problems

Archon is now open-sourced at https://github.com/frenzymath/Archon.

We are excited to share our latest work: Archon, an early-stage automated formalization system capable of translating cutting-edge mathematical proofs—written at a typical human level of detail—into Lean 4. We tested Archon on the informal proofs of FirstProof Problems 4 and 6 provided by OpenAI, and successfully produced complete formalizations for both. Problem 6 was formalized fully autonomously, while Problem 4 required only a single natural language hint regarding the proof direction for one specific lemma. In total, the formalization of Problem 4 took 50 hours and cost 1200 dollars, while Problem 6 took 30 hours and cost 750 dollars. To put these figures into perspective, a human formalization expert typically writes around 250 lines of Lean code per day, whereas our solutions contain ~8,800 lines for Problem 4 and ~3,200 lines for Problem 6.

Recent advancements in Large Language Models (LLMs) have significantly accelerated progress in automated theorem proving, yet the robust formalization of frontier research at a human-level proof granularity remains a formidable challenge. Building upon pioneering work in the field, our research explores the integration of state-of-the-art proprietary LLMs into comprehensive agentic systems to address these persisting gaps.

Archon is an agentic system built upon Claude Code, powered by Claude Opus 4.6. We packaged an updated version of our in-house mathematical information retrieval tool, LeanSearch, as a Model Context Protocol (MCP) tool and provided them to our agent. By assigning these tools a high weight in the agent’s retrieval and decision-making pipeline, we significantly improved the system’s accuracy in locating target theorems and tactics within Mathlib. Additionally, following a design philosophy similar to the Numina Agent, we integrated an MCP tool that allows Archon to consult external proprietary models (e.g., Gemini, GPT) to obtain natural language “informal guides” for complex reasoning. Furthermore, building upon existing open-source Lean 4 skills, we explicitly injected tacit expert knowledge into the system’s action space. This careful skill design reflects a dual expertise: our formalization experts contributed the structural working patterns, code conventions, and corner-case handling strategies that mirror authentic mathematical practice, and further, our LLM experts crafted skills to shape the agent’s behavior along the lines of how mathematicians naturally approach formalization—effectively channeling the model’s strengths and addressing its weaknesses, particularly around context window management. Together, these skills unlock the full performance of Opus 4.6.

We selected FirstProof for our evaluation because it represents research-level mathematics and is recent enough to guarantee zero data contamination, ensuring the originality of our system’s reasoning. Out of the 10 problems, our formalization experts identified Problems 4, 6, 9, and 10 as the most viable candidates because they possessed sufficient foundational infrastructure within Mathlib, making their formalization feasible under a reasonable workload (unlike the others, which lacked basic definitions). Further analysis revealed that Problem 9 is largely a constructive problem, where the core difficulty lies in natural language construction rather than formalization. Additionally, Problem 10 focuses on algorithmic explanation (such as explaining how an iterative linear solver works), making it difficult to define a clear, objective standard for successful formalization. This left Problems 4 and 6 as our primary targets. We provided Archon with OpenAI’s informal proofs—which were pre-verified by our math experts—and let the system run.

The formalization of Problem 4 highlighted a common challenge: certain areas in Mathlib are notoriously difficult to navigate. The standard analytical approach to Problem 4 relies on Rouché’s theorem, which is currently absent from Mathlib and would require a massive detour into algebraic topology to implement—a well-known hurdle among formalization practitioners. To address this, we injected expert rules instructing the agent to avoid known Mathlib roadblocks. Consequently, the agent intelligently pivoted away from complex analysis, opting to work within the real numbers instead.

During this process, Archon called upon Gemini to generate a natural language sub-proof for one of the intermediate steps. However, the system stalled on a specific lemma regarding the continuous variation of polynomial roots. The LLMs tended to gloss over this proof, failing to provide the explicit, step-by-step logic required to guide Lean. This lemma, requiring only fundamental calculus, is straightforward for an undergraduate math student. At this juncture, a supervising math PhD student intervened, providing a single informal hint: use Vieta’s formulas. Guided by this insight, Gemini generated a rigorous proof, and the agent promptly completed the formalization. Aside from this single informal mathematical nudge, the entire process was fully autonomous.

As for Problem 6, the formalization was completed entirely autonomously without any human intervention. The provided informal proof aligned well with existing Mathlib infrastructure, allowing the system to proceed smoothly from start to finish. We are releasing the formalized proofs alongside our GitHub repo, as well as the detailed interaction log showing the single informal hint provided for the stuck lemma in Problem 4.

For a detailed technical discussion of our agent pipeline, the problem-solving process, and our insights from leveraging Claude Opus 4.6, see our technical blog post.

To summarize: this case demonstrates that as long as the foundational definitions exist within Mathlib, papers built upon them can be transitioned from prose to code with significant efficiency. The interaction is rhythmic and low-maintenance: you can check in periodically—perhaps once every day or two—to easily verify progress and provide the high-level guidance needed to accelerate the pace.

By automating much of the tedious syntactic debugging that once stalled progress, human experts can increasingly focus on strategic route selection and conceptual mathematical innovation. We look forward to continuing this work to further streamline the formalization of new mathematical discoveries.