Papers
6 papers tracked · sorted by recency
ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity
Xiaoyang Liu, Tao Zhu, Zineng Dong, et al.
Proposes a more reliable evaluation framework for autoformalization, addressing the problem of LLM-based semantic checkers producing false positives and negatives.
ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data
Xiaoyang Liu, Kangjie Bao, Jiashuo Zhang, et al.
Addresses library hallucination by reverse-engineering natural language from valid Mathlib code (data lifting). Proves that previous approaches suffered critically from hallucinated library definitions. Proposes augmentation and synthesis to fill Mathlib coverage gaps.
Reliable Evaluation and Benchmarks for Statement Autoformalization
Auguste Poiroux, Gail Weiss, Viktor Kunčak, Antoine Bosselut
Systematically analyzes the reliability of existing evaluation pipelines for autoformalization, showing both false positive and false negative rates are significant. Proposes more robust benchmarking methodology.
Herald: A Natural Language Annotated Lean 4 Dataset
Guoxiong Gao, Yutong Wang, Jiedong Jiang, et al.
Introduces a large Lean 4 dataset with natural language annotations. Categorizes formalization errors, identifying dimensional misalignment as a major error class and complex logic translation as a minor error class.
PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition
George Tsoukalas, Jasper Lee, John Jennings, et al.
Introduces PutnamBench, a benchmark of competition-level problems. Finds that models frequently simplify complex geometric predicates into tautological statements to maximize generation probability, exposing the semantic drift problem.
Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs
Albert Q. Jiang, Sean Welleck, Jin Peng Zhou, et al.
Proposes separating high-level informal reasoning from low-level formal syntax via a three-stage pipeline: draft an informal proof, sketch a formal skeleton, then prove each step. Acknowledges that direct translation produces invalid let bindings in theorem signatures.