Autoformalization Limitations

Papers

6 papers tracked · sorted by recency

Feb 2026·Preprint

ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity

Xiaoyang Liu, Tao Zhu, Zineng Dong, et al.

Proposes a more reliable evaluation framework for autoformalization, addressing the problem of LLM-based semantic checkers producing false positives and negatives.

evaluationsemantic-similaritybenchmarking

Addresses

partialErrors in Validation Pipeline

Oct 2025·Preprint

ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data

Xiaoyang Liu, Kangjie Bao, Jiashuo Zhang, et al.

Addresses library hallucination by reverse-engineering natural language from valid Mathlib code (data lifting). Proves that previous approaches suffered critically from hallucinated library definitions. Proposes augmentation and synthesis to fill Mathlib coverage gaps.

data-augmentationmathlibhallucination

Addresses

partialLibrary Hallucination

Oct 2025·Preprint

Reliable Evaluation and Benchmarks for Statement Autoformalization

Auguste Poiroux, Gail Weiss, Viktor Kunčak, Antoine Bosselut

Systematically analyzes the reliability of existing evaluation pipelines for autoformalization, showing both false positive and false negative rates are significant. Proposes more robust benchmarking methodology.

evaluationreliabilitybenchmarking

Addresses

partialErrors in Validation Pipeline

Feb 2025·Preprint

Herald: A Natural Language Annotated Lean 4 Dataset

Guoxiong Gao, Yutong Wang, Jiedong Jiang, et al.

Introduces a large Lean 4 dataset with natural language annotations. Categorizes formalization errors, identifying dimensional misalignment as a major error class and complex logic translation as a minor error class.

datasetlean4error-taxonomy

Addresses

partialTranslation of Complex Logic partialDimensional Misalignment

Nov 2024·NeurIPS 2024

PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition

George Tsoukalas, Jasper Lee, John Jennings, et al.

Introduces PutnamBench, a benchmark of competition-level problems. Finds that models frequently simplify complex geometric predicates into tautological statements to maximize generation probability, exposing the semantic drift problem.

benchmarkcompetition-mathevaluation

Introduced / Exposed

newSemantic Drift — Tautology Trap

Feb 2023·ICLR 2023

Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs

Albert Q. Jiang, Sean Welleck, Jin Peng Zhou, et al.

Proposes separating high-level informal reasoning from low-level formal syntax via a three-stage pipeline: draft an informal proof, sketch a formal skeleton, then prove each step. Acknowledges that direct translation produces invalid let bindings in theorem signatures.

pipelineinformal-to-formallean4

Addresses

partialDefinition Bottleneck