Autoformalization Limitations

AbstractRecent advances in LLMs demonstrate high proficiency in informal mathematical reasoning and code generation. However, end-to-end autoformalization — translating natural language mathematics into formal verifiers like Lean 4 — remains brittle. This blog tracks structural and semantic bottlenecks, updated as new papers emerge.

Total

Open

Partial

Resolved

⚙️Type SystemErrors from type inference, coercions, and missing type declarations1 error

01.The Translation Paradoxmajoropen▾

Description

Informal mathematics relies on high-context ambiguity; formal mathematics requires zero-ambiguity type precision. Type inference fails because natural language omits declarations.

The Issue

Human writers omit type declarations — "Let n be a number" implies n ∈ ℕ or n ∈ ℝ depending on context.

Failure Mode

Model infers types stochastically, guessing Float for reals or Int for naturals, causing cascading type mismatch errors.

Example

Input"Let x be a number…"

Output(x : Float) … → Compiler Error: expected Set ℝ, got Float.

References

Tarrach et al. 2024Wu et al. 2025

∿SemanticsFailures in preserving the mathematical meaning of a statement3 errors

01.Dimensional Misalignmentmajoropen▾

Description

Models correctly identify algebraic structures but fail to capture dimensionality, producing valid code that formalizes a different problem.

The Issue

Notation ℚ[x,y] vs ℚ[x] provides minimal visual distinction but a critical semantic difference.

Failure Mode

Model defaults to univariate Polynomial instead of MvPolynomial (Fin 2), treating Y as a constant coefficient.

Example

Input"Prove that x²+y²−1 is irreducible in ℚ[x,y]."

OutputIrreducible (X²+Y²−1 : Polynomial ℚ) := sorry

References

Herald 2025Lu et al. 2025

Addressed / Partially Addressed By

Herald: A Natural Language Annotated Lean 4 Dataset (2025)

02.Semantic Drift — Tautology Trapmajoropen▾

Description

Models optimize for token likelihood, generating valid code that is mathematically trivial compared to the intended statement.

The Issue

Complex geometric predicates are statistically rare; simple tautologies are statistically common in training data.

Failure Mode

Model rewrites "Show four points form a convex quadrilateral" as "The convex hull of four points is convex" — a tautology.

Example

Input"Show that four points form a convex quadrilateral."

Output"Show that the convex hull of four points is convex." (tautology)

References

PutnamBench 2024Liu et al. 2024

03.Implicit Knowledge Gapstructuralopen▾

Description

Human textbooks omit steps "obvious" to readers that formal compilers strictly require, causing missing type class instance errors.

The Issue

Proofs say "it is trivial to see that X is a group" — the implicit prerequisite is absent from the source text.

Failure Mode

Lean 4 rejects code because a type class instance is missing from the dependency graph.

Example

Input"It is trivial to see that X is a group…"

Outputfailed to synthesize instance Group X

References

Tarrach et al. 2024Wu et al. 2025

{}SyntaxStructural incompatibility with Lean 4 syntax rules1 error

01.Definition Bottleneckstructuralpartial▾

Description

Fundamental conflict between the imperative flow of natural language and the declarative structure of dependent type theory.

The Issue

Informal math defines objects linearly. Formal systems require dependency injection or hypothesis lifting.

Failure Mode

Models insert let bindings into theorem signatures, syntactically illegal in Lean 4.

Example

Input"First let x = 5, now consider f(x)…"

OutputSyntax error: expected ';' or unexpected token 'let'.

References

Jiang et al. 2023Li et al. 2025

Addressed / Partially Addressed By

Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs (2023)

📚LibraryHallucinated or missing Mathlib definitions and identifiers1 error

01.Library Hallucinationmajorpartial▾

Description

Models hallucinate Mathlib predicates that don't exist, causing identifier errors despite correct mathematical reasoning.

The Issue

Lean's Mathlib is volatile and lacks formalized versions of many advanced concepts.

Failure Mode

Model invents predicates like IsIcosahedronSet, UniformDistrib, IsNormal — all absent from Mathlib.

Example

Input"Let G be a normal subgroup…"

Outputunknown identifier 'IsNormal'

References

ATLAS 2025Herald 2025Wang et al. 2025

Addressed / Partially Addressed By

ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data (2025)

✓EvaluationProblems in the verification and scoring pipeline itself1 error

01.Errors in Validation Pipelinestructuralpartial▾

Description

Both syntactic and semantic validation checks can be inaccurate, producing false positives or negatives. Real accuracy is typically lower than reported.

The Issue

Semantic evaluation pipelines often contain LLMs which can be unexpectedly unreliable as verifiers.

Failure Mode

False positive and false negative errors occur, inflating reported benchmark performance.

Example

InputAutomated semantic equivalence check

OutputIncorrect "Pass" verdict on semantically wrong formalization.

References

Liu et al. 2024 (ASSESS)Poiroux et al. 2025

Addressed / Partially Addressed By

ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity (2026)Reliable Evaluation and Benchmarks for Statement Autoformalization (2025)

∴LogicIncomplete or incorrect logical encoding of a statement1 error

01.Translation of Complex Logicminoropen▾

Description

Competition-level problems contain bidirectional logic that models simplify, losing necessary converse directions.

The Issue

Informal math contains iff constraints the LLM understands but encodes only partially.

Failure Mode

Model formalizes only one direction of an iff, producing a strictly weaker statement.

Example

InputFind c ∈ ℝᵏ, r > 0 s.t. |x−a|=2|x−b| iff |x−c|=r.

OutputFormalization encodes only the → direction; ← direction is absent.

References

Herald 2025Liu et al. 2024Guo et al. 2025

Addressed / Partially Addressed By

Herald: A Natural Language Annotated Lean 4 Dataset (2025)

⊞Data & BenchmarksGaps between benchmark distributions and real-world complexity1 error

01."The Wild" vs "The Lab"structuralopen▾

Description

Benchmarks like MiniF2F cover well-supported high-school problems. Real-world definitions are far more demanding and context-dependent.

The Issue

Wild definitions are intricate, context-dependent, and implicit — assuming extensive background theory.

Failure Mode

Models tuned on MiniF2F fail on real-world definitions requiring deeply nested formal structures.

Example

InputArbitrary Wikipedia mathematical definition

OutputCompilation failure or semantically incorrect formalization.

References

Zhang et al. 2025Li et al. 2025