Changelog
A chronological record of how understanding in this field has evolved.
Initial limitation report published
Nine critical failure modes identified for neural-symbolic autoformalization on complex mathematics (Putnam-level). Categories established: Type System, Semantics, Syntax, Library, Evaluation, Logic, Data & Benchmarks.
Improved evaluation reliability via ASSESS and Poiroux et al.
Two concurrent works tackle the unreliable validation pipeline problem. ASSESS introduces a semantic and structural similarity framework; Poiroux et al. propose robust benchmarking methodology. Both reduce false positive rates substantially.
ATLAS partially addresses library hallucination
Liu et al. propose data lifting — reverse-engineering natural language from valid Mathlib code — to reduce hallucinated identifier errors. Coverage of advanced Mathlib concepts improves but does not eliminate the gap for concepts not yet formalized.
PutnamBench exposes semantic drift at competition level
Tsoukalas et al. demonstrate that state-of-the-art models systematically replace complex geometric predicates with tautological equivalents when formalizing Putnam problems, introducing the concept of the "Tautology Trap" as a distinct failure mode.