Matlas: A Semantic Search Engine for Mathematics

This paper is now available on arXiv: arXiv:2604.17484. The web service is available at matlas.ai, and the API documentation is available at matlas.ai/docs.

Abstract

Retrieving mathematical knowledge is a central task in both human-driven research, such as determining whether a result already exists, finding related results, and identifying historical origins, and in emerging AI systems for mathematics, where reliable grounding is essential. However, the scale and structure of the mathematical literature pose significant challenges: results are distributed across millions of documents, and individual statements are often difficult to interpret in isolation due to their dependence on prior definitions and theorems.

We introduce Matlas, a semantic search engine for mathematical statements. Matlas is built on a large-scale corpus of 8.07 million statements extracted from 435K peer-reviewed papers spanning 1826 to 2025, drawn from a curated set of 180 journals selected using an ICM citation-based criterion, together with 1.9K textbooks. From these sources, we extract mathematical statements together with their dependencies, construct document-level dependency graphs, and recursively unfold statements in topological order to produce more self-contained representations. On top of this corpus, we develop a semantic retrieval system that enables efficient search for mathematical results using natural language queries.

We hope that Matlas can improve the efficiency of theorem retrieval for mathematicians and provide a structured source of grounding for AI systems tackling research-level mathematical problems, and serve as part of the infrastructure for mathematical knowledge retrieval.

Introduction

Retrieving mathematical results is a fundamental activity in mathematical research. Discovering related results and variants is essential for advancing new work, while determining whether a result already exists helps avoid redundant effort and overlooked prior contributions. Identifying the original source and historical origin of a theorem is also crucial for proper attribution. However, the mathematical literature has grown dramatically over time and now contains millions of articles and books, making it increasingly challenging to locate relevant results efficiently.

Figure 1: The cumulative number of mathematics journal papers from 1826 to 2024, computed from MathSciNet metadata by counting entries with `entryType = "J"` as journal publications. This count covers all journals, not only the selected set of 180 used in Matlas.

Beyond its importance for human researchers, mathematical retrieval is also a key component for AI systems aiming to perform research-level mathematics. Recent advances in large language models (LLMs) have enabled them to tackle certain research-level problems (Feng et al., 2026; Feng, 2026; Patel, 2026; Lee and Seo, 2026). However, grounding such systems in existing mathematical knowledge is likely to be beneficial. In particular, two aspects are critical: reliability and self-containment. For reliability, peer-reviewed publications are generally more trustworthy than arXiv preprints, despite the possibility of errors. For self-containment, mathematical statements are inherently dependent on prior definitions and results; as a consequence, an isolated statement is often difficult to interpret without its dependencies. This motivates the need to recursively unfold statements so that they become self-contained and interpretable.

Motivated by these considerations, namely supporting human mathematicians and providing stronger grounding for AI systems, we develop Matlas, a semantic search engine for mathematics. Matlas is built on a large-scale corpus comprising 8.07 million mathematical statements extracted from 435K published papers and 1.9K textbooks. The papers are drawn from a curated set of 180 journals selected via an ICM citation-based criterion, spanning the years 1826 to 2025. From these sources, we extract statements together with their dependencies, construct directed dependency graphs for each document, and unfold statements in topological order to obtain more self-contained representations. On top of this corpus, we build a semantic search system to enable efficient retrieval of mathematical results.

An early version of Matlas was used in our work on Automated Conjecture Resolution with Formal Verification, where it helped identify a key technical result that supported a reasoning agent in resolving an open conjecture, suggesting its potential value in mathematical research.

In this section, we review prior work on semantic search engines for both formal and natural language mathematics. For Lean’s mathematical library mathlib4, several semantic search systems have been developed, including Moogle, LeanSearch, LeanExplore, LeanFinder, and LeanDex. Most of these systems construct an aligned informal–formal corpus derived from mathlib4 and perform retrieval over either the informal corpus alone or a combination of informal and formal representations. For natural language mathematics, Alexander et al., 2026 constructs a semantic search engine for mathematical statements using a corpus drawn primarily from arXiv papers, with a smaller portion from sources such as ProofWiki, the Stacks Project, and the Open Logic Project. For arXiv papers, they extract theorem statements from LaTeX sources and use an LLM to generate a short natural language description for each statement based on the statement itself and the introduction of the paper. This description serves as the representation of the theorem and is used for embedding, rather than the original statement. Our work differs from Alexander et al., 2026 in two main aspects. First, our corpus consists of published papers from a curated set of 180 journals (spanning 1826 to 2025) together with textbooks, rather than arXiv papers, whose coverage begins in 1991. Second, we aim to make statements self-contained: we extract dependencies between statements, construct a directed graph for each document, and unfold the statements hierarchically based on this dependency structure.

Methodology

In this section, we describe our methodology for building Matlas. Figure 2 provides an overview of the framework. To construct a theorem search engine based on relatively reliable sources, we select a set of 180 journals using an ICM citation-based criterion and collect 435K published papers spanning 1826 to 2025 from these journals, together with 1.9K textbooks. While peer review does not guarantee absolute correctness, it generally provides a higher level of reliability than preprints such as those on arXiv. We then apply OCR to the PDFs to obtain raw text.

Overview of the Matlas pipeline — Figure 2: Overview of our pipeline.

From this corpus, we extract 8.07 million mathematical statements together with their dependencies, since many statements rely on preceding definitions or results. For each paper or book, we construct a directed graph capturing these dependencies, and then unfold the statements in topological order to produce self-contained statements. Next, we build the semantic search engine by embedding the processed statements into vector representations and indexing them in a vector database. Given a query, we embed it into a vector and retrieve relevant statements using cosine similarity.

Corpus Construction

We construct a set of 180 journals through an ICM-based filtering process. Specifically, we first gather the full collection of ICM proceedings and extract the references cited in each paper. For every journal, we then count how many times its papers published between 2007 and 2021 are cited by these proceedings. We keep only those journals that receive more than 50 such citations and that have published at least 100 papers over the same period. We obtain metadata for papers in these journals from MathSciNet and collect 606K PDFs of such papers, which represent a subset of all papers published in these journals throughout their history. Later, we find that only 435K of them contain mathematical statements. Some papers indexed by MathSciNet, such as certain applied mathematics, computational physics, and computational chemistry papers, do not contain mathematical statements. For textbooks, we compile a book list from five publishers: Springer, Princeton, Cambridge, Oxford, and AMS. This list contains 3.3K textbooks, of which we are able to collect 1.9K. After collecting the papers and textbooks, we apply OCR to the PDFs to prepare them for further processing.

Statement and Dependency Extraction

In this subsection, we focus on extracting statement-level units from mathematical texts, such as definitions, theorems, and lemmas, together with their dependency relations. A straightforward approach would be to either provide the entire document to a large language model or divide the document into smaller chunks and process them independently. However, both approaches are not well-suited for mathematical content: the full document often exceeds model context limits, while naive chunking can disrupt the internal structure of statements by splitting them across boundaries.

To overcome these issues, we design a two-stage pipeline (Figure 3). In the first stage, an LLM processes the markdown text produced by PDF OCR and generates document-specific regular expressions that reflect local formatting patterns of statements. These patterns are then applied to the text to identify candidate statement spans, each corresponding to a single statement unit. The objective of this stage is to reliably localize such spans and produce stable, statement-centered inputs for subsequent processing.

In the second stage, the extracted candidates are handled in small, partially overlapping batches, with a default batch size of five. For each batch, we construct forward context windows around the candidate positions in the OCR markdown text, using a default window length of 4000 characters. Overlapping windows are merged before being passed to the model. Based on this merged context, the LLM structurer transforms each candidate into a structured representation that includes its type, textual content, and local dependency links. We use DeepSeek-V3.2 as the LLM in this pipeline.

Extracting dependencies is a key component of the pipeline. Mathematical statements are often not self-contained, as they depend on earlier definitions, notation, or intermediate results. Therefore, the system also identifies local dependency relationships among statements. This information is valuable both for understanding the structure of the text and for supporting the subsequent unfolding stage.

Statement Unfolding

After extracting statements and their dependencies from each paper and textbook, we construct a directed graph for each document. Each node represents a mathematical statement, and there is an edge from node A to node B if and only if B depends on A. An example is shown in Figure 4.

An example dependency graph and layered structure — Figure 4: An example of the directed dependency graph and its layered structure. Statement unfolding proceeds from layer 0 to the highest layer.

To make statements self-contained, we then unfold this graph in topological order. Specifically, we first partition the graph into layers, where each layer consists of nodes with zero in-degree after removing all nodes from the preceding layers. We then process the layers sequentially, starting from layer 0 to the final layer, recursively expanding each statement using the already processed statements from earlier layers.

We note that LemmaBench also aims to make statements in mathematical papers self-contained by extracting the definitions and assumptions required for a given statement. They explore two approaches. The first is full-context retrieval, where all content preceding the target statement is provided to an LLM, which then identifies the relevant definitions and assumptions. The second is vector-based retrieval, where an LLM first extracts non-trivial objects appearing in the statement; regular expressions are then used to locate paragraphs containing these objects, followed by dense retrieval to select the top-$k$ candidate paragraphs. These paragraphs, together with the target objects, are then fed into an LLM to extract the necessary definitions and assumptions. Our method differs from theirs in that we do not attempt to recover all dependent statements in a single pass. Instead, we extract only first-order dependencies (corresponding to the previous layer) and expand statements incrementally in a layer-by-layer manner.

After obtaining the unfolded statements, totaling 8.07 million, we build our theorem search engine using dense retrieval. Specifically, we encode each unfolded statement into a vector representation using Qwen3-Embedding-8B and store these vectors in a vector database. Given a query, we prepend a theorem-retrieval-oriented instruction to the input, encode the resulting text into a vector, and retrieve relevant statements from the database using cosine similarity.

Conclusion

In this paper, we develop Matlas, a semantic search engine for mathematical statements drawn from relatively reliable sources. The underlying corpus is constructed by collecting PDFs of papers published in a curated set of 180 journals, selected using an ICM citation-based criterion to filter out relatively low-quality venues, together with 1.9K textbooks. These documents are processed through two phases: statement and dependency extraction, and statement unfolding. In the statement and dependency extraction phase, we first identify the positions of individual statements, and then assemble batches of statements along with their surrounding context to input to an LLM for dependency extraction. In the statement unfolding phase, we construct a directed graph for each document based on the extracted dependencies, and unfold statements in topological order to obtain more self-contained representations.

We hope that Matlas can improve the efficiency of theorem retrieval for mathematicians and provide stronger grounding for AI systems tackling research-level mathematical problems using reliable sources. By organizing mathematical knowledge at the level of individual statements and their dependencies, Matlas may also facilitate the transfer and reuse of tools and results across different areas of mathematics. Overall, we view Matlas as a step toward building infrastructure for mathematical knowledge retrieval that supports both human and AI-driven mathematical discovery.