We are excited to share our latest progress on the reap tactic. We have extended reap into an end-to-end Monte Carlo Tree Search (MCTS) implementation. It now serves as a unified framework for tactic writing and reinforcement learning (RL) rollout. Based on this framework, we updated the backend generator model to a lightweight 1.7B-parameter model, reaper-1, achieving SOTA performance on minif2f-test among models of similar size. During training, we observed that reaper-1 demonstrated strong tactic-level reasoning capabilities and underwent drastic changes in proof style. This led us to rethink the boundaries of formal reasoning in small LLMs.
We will release the unified framework in the reap repository, and we hope that this framework and our experience can inspire more research at the intersection of interactive theorem proving and reinforcement learning.
Demo
Here is a demo of our MCTS implementation in Lean 4. The current version of reap can prove quite a few complex theorems from Mathematics in Lean.
Motivation
Since AlphaZero, MCTS-based reinforcement learning has been one of the clearest examples of search and learning amplifying each other. In the last few years, the same idea has become compelling in formal theorem proving. Systems such as HTPS, AlphaProof, ABEL, Aristotle, and BFS-Prover series have shown, in different ways, that model-guided search over proof states can scale from benchmark proving to IMO-level formal reasoning. We find these results deeply inspiring. At the same time, they are still hard for many researchers to inspect, reproduce, and modify. Unlike Go, Lean proof search is not a closed simulator with a small, fixed action space: every edge of the tree is a tactic execution, and each transition depends on Lean’s elaborator, metavariables, goal management, timeouts, error reporting, replay, and final checking. Small mistakes in this interface can make rollouts noisy or silently unstable.
Another barrier is visibility. Some impressive systems do not release the full inference and training stack, while others, such as ABEL, have valuable public code but still require substantial adaptation around model serving, training infrastructure, and the Lean interaction layer. Even when models or code are available, the subtle details of Lean interaction are often spread across services rather than presented as a small, inspectable proof-search loop. This makes it difficult for researchers who are excited by these phenomena to run experiments, debug failures, or study the dynamics of formal tree search by hand.
Our motivation for Reap came from the opposite direction. While building the original reap tactic, we noticed that a large part of an RL rollout worker is simply proof search inside Lean. If that search lives inside the Lean tactic system, it can operate directly on native proof states and Lean’s Meta APIs, avoiding much of the overhead and fragility of an external Lean execution loop. Aesop and ABEL’s Aesop-based TreeRepl interface both show how powerful this direction can be. What we wanted for our own RL research, however, was a smaller tactic-level interface with explicit tactic evaluation, logging, profiling, raw tree export, replay construction, and enough control over search hyperparameters to understand what the rollout is doing. This led us to reimplement MCTS in Lean 4 as a reap tactic: not as a replacement for the large-scale systems above, but as a lightweight and transparent bridge between everyday tactic use and reinforcement-learning rollout.
Reap keeps rollout, model serving, and training cleanly separated: rollout workers produce replay samples, the trainer emits new weights, and inference workers expose an OpenAI-compatible API back to the reap tactic.
Framework design
Reap implements the MCTS rollout loop directly as a Lean tactic. Each node stores a native Tactic.SavedState; expansion calls the policy, value, and premise-selection services on the current pretty-printed proof state; every candidate edge is checked by Lean before it enters the tree. The search parameters are ordinary Lean options, making the rollout behavior easy to configure from theorem files.
AND-OR proof states: Reap follows the AND-OR view used by AlphaProof and ABEL, but the split is decided from Lean states rather than from a separate environment abstraction. An OR node represents tactic choice: one solved child is enough. An AND node represents goal decomposition: all focused subgoals must be solved. In the implementation, Reap promotes a state to an AND node only when a tactic leaves multiple independent goals and none of them contains expression metavariables. This distinction matters. After constructor on P ∧ Q, Reap creates focused children and can reconstruct constructor\n· exact hP\n· exact hQ. After constructor on ∃ n : Nat, n = n, however, Lean introduces a witness metavariable, so Reap keeps the state as an OR node instead of pretending that the generated goals are independent proof obligations.
Tactic evaluation and budgets: Reap parses a tactic string with Lean’s tactic parser, rejects unsafe or non-training-friendly forms such as sorry, admit, repeat', and simp?, collects explicit global constants for premise accounting, then runs it inside TacticM with both a heartbeat limit and a wall-clock timeout. The resulting failures are structured as parse errors, forbidden tactics, Lean exceptions, timeouts, or error messages reported by Lean. This makes rollout data much easier to inspect: a branch that failed because nlinarith timed out is different from one that failed because the tactic was parsed incorrectly or produced an error message.
Replay and final checking: Reap treats a solved search path as a candidate proof, not as a proof until it can be replayed and checked. During search, states are restored from SavedStates and duplicate states are merged by a pretty-printed state key, so self-loops such as skip can be dropped and repeated transitions can share prior mass. When a solution is found, Reap reconstructs a Lean script from OR choices and AND bullets, replays it from the root state, and validates the original goals through temporary declarations. This catches cases where a tactic appears to close a goal but leaves hidden metavariables, sorry, invalid auxiliary declarations, or a proof that fails only at final declaration checking.
Tracing and profiling: Reap can export the full search tree, including states, tactics, priors, visits, values, used premises, and PUCT scores. It also writes wall-clock traces for premise selection, policy generation, value calls, and tactic evaluation, so failed or surprising rollouts can be inspected after the run instead of treated as opaque training noise.
Model training
Model architecture
In the RL stage, we use a shared-backbone policy-value architecture. The language-model head generates tactic candidates for the current proof state, while an additional categorical value head predicts a binned estimate of the remaining proof-search cost. Reap uses these two outputs to run Lean-native MCTS rollouts; successful trajectories are then converted into replay samples for the next learner update.
Reaper uses the LLM policy path for tactic generation and adds a value head that branches from the final hidden layer.
Training stages
We will describe the full training recipe in later technical reports. At a high level, reaper is initialized from Qwen3-1.7B-Base and trained in three stages:
- Continued pre-training on a corpus of mathematical texts and curated proof data.
- Supervised fine-tuning on theorem-proving data from mathlib4 and in-house human annotations. We use
jixiato extract tactic-state-premise triplets from Lean files and derive value labels for the corresponding states. This gives us roughly 1.5M supervised examples. - Reinforcement learning on high-quality formal statements, including curated textbook problems and AoPS problems. We use our previous autoformalization pipeline to translate informal statements and obtain around 25K formal statements for rollout.
For the first four RL iterations, we run one rollout pass over the full problem set and collect replay samples from successful trajectories. After this warm-up stage, we split the problems into smaller batches. At each RL step, we run four rollout passes on one batch, filter samples by proof length and root-node value, and use the selected trajectories to update both the policy and value heads. We still rehearse the mathlib4 data in each RL step to avoid catastrophic forgetting.
The model is still being trained. The curve below shows its early-stage reinforcement-learning progress. Currently we achieve 77.5% accuracy on minif2f-test pass@32, and 78.7% accumulated during RL steps. We will continue to improve the model and report the final results in future updates.
Performance
We evaluate the model on minif2f-test. The following chart compares reaper-1 with open-source step-prover models of similar scale. Closed-source systems are still far ahead: Specially, AlphaProof reports 99.6% accuracy on minif2f-test with only 3B parameters, so there remains a large gap to close.
Interesting observations
During training, we observed drastic changes in proof style. Here are some interesting proofs that emerged during training.
Early checkpoints already show some mathematical insight, but they still rely on very greedy tactic usage.
theorem numbertheory_x5neqy2p4 (x y : ℤ) : x ^ 5 ≠ y ^ 2 + 4 := by
by_contra! h10'
have h10 : x ^ 5 % 11 = (y ^ 2 + 4) % 11 := by rw [← h10']
have h11 : x % 11 = 0 ∨ x % 11 = 1 ∨ x % 11 = 2 ∨ x % 11 = 3 ∨ x % 11 = 4 ∨ x % 11 = 5 ∨ x % 11 = 6 ∨ x % 11 = 7 ∨ x % 11 = 8 ∨ x % 11 = 9 ∨ x % 11 = 10 := by omega
rcases h11 with (h11 | h11 | h11 | h11 | h11 | h11 | h11 | h11 | h11 | h11 | h11) <;>
simp [h11, pow_two, pow_succ, Int.mul_emod, Int.add_emod] at h10' h10 <;>
try (first | revert h10' | revert h10 | omega) <;>
(
{
have h100 : y % 11 = 0 ∨ y % 11 = 1 ∨ y % 11 = 2 ∨ y % 11 = 3 ∨ y % 11 = 4 ∨ y % 11 = 5 ∨ y % 11 = 6 ∨ y % 11 = 7 ∨ y % 11 = 8 ∨ y % 11 = 9 ∨ y % 11 = 10 := by omega
rcases h100 with (h100 | h100 | h100 | h100 | h100 | h100 | h100 | h100 | h100 | h100 | h100) <;>
simp [h100, pow_two, Int.mul_emod, Int.add_emod] at h10 <;>
omega
}
)
With more training, the model learns to write more structural and elegant proofs and to use tactics more flexibly.
theorem induction_nfactltnexpnm1ngt3_rl4_09 (n : ℕ) (h₀ : 3 ≤ n) : n ! < n ^ (n - 1) := by
induction' n, h₀ using Nat.le_induction with n h₀ h₁
<;> simp [factorial_succ, pow_succ, mul_comm, mul_left_comm, mul_assoc, *]
have h₂ : n ^ (n - 1) < (n + 1) ^ (n - 1) := ?_
· rcases n with (_ | _ | _ | _ | n) <;> simp_all [pow_succ]
· try simp [factorial] at h₁ ⊢ <;> try omega
· linarith [factorial_succ (n + 1 + 1 + 1 + 1), h₁, h₂]
· exact Nat.pow_lt_pow_left (by cutsat) (by cutsat)
theorem aime_1997_p9 (a : ℝ) (h₀ : 0 < a)
(h₁ : 1 / a - Int.floor (1 / a) = a ^ 2 - Int.floor (a ^ 2)) (h₂ : 2 < a ^ 2) (h₃ : a ^ 2 < 3) :
a ^ 12 - 144 * (1 / a) = 233 := by
field_simp [h₀.ne'] at h₁ ⊢
have h₄ : ⌊a^2⌋ = 2 := by
rw [Int.floor_eq_iff]
norm_num
constructor
· nlinarith
· nlinarith [sq_nonneg (a - Real.sqrt 3), Real.sq_sqrt (by positivity : (0 : ℝ) ≤ 3)]
norm_num [h₄] at h₁ <;> simp_all [sq]
rcases lt_trichotomy (a⁻¹) 1 with h₅|h₅|h₅
· have h₅' : (0 : ℝ) < a⁻¹ := inv_pos.mpr h₀
have h₅'' : a⁻¹ - 1 < 0 := by linarith [h₅]
have : ⌊a⁻¹⌋ = 0 := by
apply Int.floor_eq_zero_iff.mpr
exact ⟨by positivity, by linarith⟩
simp_all [zpow_ofNat, Real.rpow_def_of_pos, zpow_zero, mul_one]
grind [zpow_ofNat, zpow_neg, div_eq_mul_inv,
mul_inv_cancel_right₀, inv_nonneg, sq_nonneg, sub_nonneg]
· field_simp at h₅ <;> nlinarith [h₀, h₂, h₃]
· have h₆ : a⁻¹ < 2 := by
by_contra h₆
have h₇ : (2 : ℝ) ≤ a⁻¹ := by linarith [h₆]
have h₈ : (2 : ℝ) * a ≤ 1 := by
calc
(2 : ℝ) * a ≤ a⁻¹ * a := by gcongr
_ = 1 := by field_simp; nlinarith [h₃]
have h₇ : ⌊a⁻¹⌋ = 1 := Int.floor_eq_iff.mpr ⟨by linarith, by linarith⟩
simp_all [h₇, Int.cast_one, pow_one, mul_one, sub_left_inj]
nlinarith [Int.floor_eq_iff.mp h₄, Int.floor_eq_iff.mp h₇]
Next steps
This project is still in an experimental stage. We have seen the model gradually learn to use more complex tactics during RL, but we are more interested in whether small models can also develop deeper mathematical insight through search and self-improvement. One of our main goals is to better understand the boundary of small-model reasoning in Lean, and to see how far this line of work can move toward the kind of high-level problem-solving behavior demonstrated by systems such as AlphaProof and Aristotle.
At the same time, Reap is also meant to be a practical tactic for the Lean community. We plan to continue adding data from broader areas of mathlib, strengthen the model’s coverage across different mathematical domains, and keep improving the tooling around rollout, debugging, and deployment. We hope Reap can become both a useful proof-search assistant and a transparent platform for studying reinforcement learning in formal theorem proving.
Acknowledgements
This project is a collaborative effort from PKU AI4Math team and IQuest Research. We also thank the Lean community for their constant feedback and support for the Reap tactic.
Citation
If you find this work useful, please cite this blog post as:
@misc{wang2026reap,
title = {Reap: Decoding Monte Carlo Tree Search in Lean 4},
author = {Wang, Yutong and Ren, Xuanzhi and Xu, Tianyi and Liu, Ruitong and Dong, Bin and Wu, Peihao and Dai, Bryan},
year = {2026},
month = jun,
note = {Blog post},
url = {https://frenzymath.com/blog/reaper-1/}
}