Evaluating LLM Robustness in Mathematical Proof Autoformalization

The Autoformalization of Mathematical Proofs and the Role of LLMs

Mathematical proof autoformalization represents a promising frontier at the intersection of artificial intelligence and mathematics. The primary goal of this discipline is to translate an informal mathematical proof, expressed in natural language, into a formal proof understandable and verifiable by systems like Lean 4. In recent years, several research teams have developed models based on Large Language Models (LLMs) to tackle this complex task, aiming to bridge the gap between the flexibility of human language and the rigor of formal systems.

However, existing evaluations of these models have primarily focused on well-structured informal proofs, often drawn from curated and idealized datasets. This approach, while useful for demonstrating basic capabilities, has not thoroughly explored the resilience of models when faced with more realistic and less perfect scenarios. The question of robustness—the ability of a system to maintain its performance even in the presence of variations or imperfections in inputs—has largely remained unexplored.

A New Benchmark for Robustness: Global and Local Perturbations

To address this gap, a recent study proposes the first in-depth analysis of the robustness of proof autoformalization models. Researchers formulated two distinct categories of 'perturbations' to evaluate the stability and faithfulness of these systems. The first, termed 'global perturbation,' involves paraphrasing the informal proof in a different style while retaining its original meaning. In this scenario, a robust autoformalizer should produce a formalization that remains consistent with the initial mathematical intent, regardless of stylistic variations.

The second category, 'local perturbation,' entails altering a specific value, symbol, or proof step, possibly in a counterfactual way. Here, robustness is demonstrated by the model's ability to faithfully reflect the modification in the formal output, rather than ignoring it, reverting to the original version, or autonomously inferring a different interpretation. To conduct this evaluation, a new benchmark was created by applying both types of perturbations to the miniF2F and MATH-500 datasets, automatically measuring the stability of correctness under global perturbations and the faithfulness of the output under local ones.

Implications for Enterprise LLM Deployments

The study's findings, which involved seven recent models, are significant: all tested models proved sensitive to global perturbations and, in most cases, failed to remain faithful when faced with local perturbations. This discovery raises crucial questions for organizations considering the deployment of LLMs for critical tasks, especially in on-premise or air-gapped environments where data sovereignty and control are paramount.

For CTOs, DevOps leads, and infrastructure architects, the robustness of an LLM is not an academic detail but a decisive factor for operational reliability and security. A non-robust model can generate unpredictable or erroneous outputs, increasing operational risk and the Total Cost of Ownership (TCO) due to the need for intensive human oversight and additional validation processes. In contexts where precision is non-negotiable, such as formalizing smart contracts or verifying complex algorithms, a lack of robustness can compromise system integrity and trust in AI-driven tools. For those evaluating on-premise deployments, AI-RADAR offers analytical frameworks on /llm-onpremise to assess these trade-offs, emphasizing the importance of models that are not only performant but also reliable.

Future Prospects and Development of More Resilient Models

The results of this research highlight the need to develop more robust LLMs for proof autoformalization and, by extension, for other critical applications. The sensitivity to stylistic variations and the inability to faithfully reflect minor input changes suggest that current models may not yet be ready for widespread adoption in high-stakes scenarios without significant improvements. The availability of the benchmark code and data on GitHub (https://github.com/ucr-rai/robust-proof-autoformalization) provides the research community with a valuable tool to replicate experiments, explore new model architectures, and refine training techniques.

The path toward truly robust LLMs will likely require greater attention to training data diversity, the incorporation of more explicit reasoning mechanisms, and the development of fine-tuning techniques that prioritize consistency and faithfulness. Only through continuous commitment to research and development will it be possible to realize the full potential of LLM-based autoformalization, making it a reliable resource for the mathematical community and the most demanding enterprise applications.