Autoformalization Limitations

Changelog

A chronological record of how understanding in this field has evolved.

February 21, 2026Report

Initial limitation report published

Nine critical failure modes identified for neural-symbolic autoformalization on complex mathematics (Putnam-level). Categories established: Type System, Semantics, Syntax, Library, Evaluation, Logic, Data & Benchmarks.

Limitations:The Translation Paradox Errors in Validation Pipeline Translation of Complex Logic Dimensional Misalignment Definition Bottleneck Semantic Drift — Tautology Trap Library Hallucination "The Wild" vs "The Lab"Implicit Knowledge Gap

October 15, 2025Progress

Improved evaluation reliability via ASSESS and Poiroux et al.

Two concurrent works tackle the unreliable validation pipeline problem. ASSESS introduces a semantic and structural similarity framework; Poiroux et al. propose robust benchmarking methodology. Both reduce false positive rates substantially.

Papers:ASSESS: A Semantic and Structural Evaluation Frame…Reliable Evaluation and Benchmarks for Statement A…

Limitations:Errors in Validation Pipeline

October 1, 2025Progress

ATLAS partially addresses library hallucination

Liu et al. propose data lifting — reverse-engineering natural language from valid Mathlib code — to reduce hallucinated identifier errors. Coverage of advanced Mathlib concepts improves but does not eliminate the gap for concepts not yet formalized.

Papers:ATLAS: Autoformalizing Theorems through Lifting, A…

Limitations:Library Hallucination

November 1, 2024New Finding

PutnamBench exposes semantic drift at competition level

Tsoukalas et al. demonstrate that state-of-the-art models systematically replace complex geometric predicates with tautological equivalents when formalizing Putnam problems, introducing the concept of the "Tautology Trap" as a distinct failure mode.

Papers:PutnamBench: Evaluating Neural Theorem-Provers on …

Limitations:Semantic Drift — Tautology Trap