Research Blog · Neural-Symbolic AutoformalizationAustin Shen

Current Limitations of Autoformalization for Mathematics

A living document tracking open problems, recent papers, and field progress. Lean 4 / Mathlib.

AbstractRecent advances in LLMs demonstrate high proficiency in informal mathematical reasoning and code generation. However, end-to-end autoformalization — translating natural language mathematics into formal verifiers like Lean 4 — remains brittle. This blog tracks structural and semantic bottlenecks, updated as new papers emerge.
9
Total
6
Open
3
Partial
0
Resolved
⚙️Type SystemErrors from type inference, coercions, and missing type declarations1 error
01.The Translation Paradoxmajoropen
Description

Informal mathematics relies on high-context ambiguity; formal mathematics requires zero-ambiguity type precision. Type inference fails because natural language omits declarations.

The Issue

Human writers omit type declarations — "Let n be a number" implies n ∈ ℕ or n ∈ ℝ depending on context.

Failure Mode

Model infers types stochastically, guessing Float for reals or Int for naturals, causing cascading type mismatch errors.

Example
Input"Let x be a number…"
Output(x : Float) … → Compiler Error: expected Set ℝ, got Float.
References
Tarrach et al. 2024Wu et al. 2025
SemanticsFailures in preserving the mathematical meaning of a statement3 errors
01.Dimensional Misalignmentmajoropen
Description

Models correctly identify algebraic structures but fail to capture dimensionality, producing valid code that formalizes a different problem.

The Issue

Notation ℚ[x,y] vs ℚ[x] provides minimal visual distinction but a critical semantic difference.

Failure Mode

Model defaults to univariate Polynomial instead of MvPolynomial (Fin 2), treating Y as a constant coefficient.

Example
Input"Prove that x²+y²−1 is irreducible in ℚ[x,y]."
OutputIrreducible (X²+Y²−1 : Polynomial ℚ) := sorry
References
Herald 2025Lu et al. 2025
Addressed / Partially Addressed By
02.Semantic Drift — Tautology Trapmajoropen
Description

Models optimize for token likelihood, generating valid code that is mathematically trivial compared to the intended statement.

The Issue

Complex geometric predicates are statistically rare; simple tautologies are statistically common in training data.

Failure Mode

Model rewrites "Show four points form a convex quadrilateral" as "The convex hull of four points is convex" — a tautology.

Example
Input"Show that four points form a convex quadrilateral."
Output"Show that the convex hull of four points is convex." (tautology)
References
PutnamBench 2024Liu et al. 2024
03.Implicit Knowledge Gapstructuralopen
Description

Human textbooks omit steps "obvious" to readers that formal compilers strictly require, causing missing type class instance errors.

The Issue

Proofs say "it is trivial to see that X is a group" — the implicit prerequisite is absent from the source text.

Failure Mode

Lean 4 rejects code because a type class instance is missing from the dependency graph.

Example
Input"It is trivial to see that X is a group…"
Outputfailed to synthesize instance Group X
References
Tarrach et al. 2024Wu et al. 2025
{}SyntaxStructural incompatibility with Lean 4 syntax rules1 error
01.Definition Bottleneckstructuralpartial
Description

Fundamental conflict between the imperative flow of natural language and the declarative structure of dependent type theory.

The Issue

Informal math defines objects linearly. Formal systems require dependency injection or hypothesis lifting.

Failure Mode

Models insert let bindings into theorem signatures, syntactically illegal in Lean 4.

Example
Input"First let x = 5, now consider f(x)…"
OutputSyntax error: expected ';' or unexpected token 'let'.
References
Jiang et al. 2023Li et al. 2025
Addressed / Partially Addressed By
📚LibraryHallucinated or missing Mathlib definitions and identifiers1 error
01.Library Hallucinationmajorpartial
Description

Models hallucinate Mathlib predicates that don't exist, causing identifier errors despite correct mathematical reasoning.

The Issue

Lean's Mathlib is volatile and lacks formalized versions of many advanced concepts.

Failure Mode

Model invents predicates like IsIcosahedronSet, UniformDistrib, IsNormal — all absent from Mathlib.

Example
Input"Let G be a normal subgroup…"
Outputunknown identifier 'IsNormal'
References
ATLAS 2025Herald 2025Wang et al. 2025
Addressed / Partially Addressed By
EvaluationProblems in the verification and scoring pipeline itself1 error
01.Errors in Validation Pipelinestructuralpartial
Description

Both syntactic and semantic validation checks can be inaccurate, producing false positives or negatives. Real accuracy is typically lower than reported.

The Issue

Semantic evaluation pipelines often contain LLMs which can be unexpectedly unreliable as verifiers.

Failure Mode

False positive and false negative errors occur, inflating reported benchmark performance.

Example
InputAutomated semantic equivalence check
OutputIncorrect "Pass" verdict on semantically wrong formalization.
References
Liu et al. 2024 (ASSESS)Poiroux et al. 2025
Addressed / Partially Addressed By
LogicIncomplete or incorrect logical encoding of a statement1 error
01.Translation of Complex Logicminoropen
Description

Competition-level problems contain bidirectional logic that models simplify, losing necessary converse directions.

The Issue

Informal math contains iff constraints the LLM understands but encodes only partially.

Failure Mode

Model formalizes only one direction of an iff, producing a strictly weaker statement.

Example
InputFind c ∈ ℝᵏ, r > 0 s.t. |x−a|=2|x−b| iff |x−c|=r.
OutputFormalization encodes only the → direction; ← direction is absent.
References
Herald 2025Liu et al. 2024Guo et al. 2025
Addressed / Partially Addressed By
Data & BenchmarksGaps between benchmark distributions and real-world complexity1 error
01."The Wild" vs "The Lab"structuralopen
Description

Benchmarks like MiniF2F cover well-supported high-school problems. Real-world definitions are far more demanding and context-dependent.

The Issue

Wild definitions are intricate, context-dependent, and implicit — assuming extensive background theory.

Failure Mode

Models tuned on MiniF2F fail on real-world definitions requiring deeply nested formal structures.

Example
InputArbitrary Wikipedia mathematical definition
OutputCompilation failure or semantically incorrect formalization.
References
Zhang et al. 2025Li et al. 2025