Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization

출처:arXiv cs.AI

✨ AI 인사이트

🧑‍💻 개발자

1.자연어-Lean 형식화에서 컴파일 통과가 아닌 '충실한 명제 생성'을 평가하는 프로토콜 제안
2.400개 대학원 수준 벤치마크에서 도구 증강 에이전트가 컴파일 89.5% vs 충실도 60.5%로 29점 격차 노출
3.합의 긍정 출력의 96.0%가 인간 확인 충실, 컴파일 통과지만 합의 부정인 출력의 82.4%는 의미 오류
4.2^3 요인 설계로 엘라보레이션 피드백이 최대 유효성 개입임을 분해 분석

💡

왜 중요한가?

정리증명 벤치마크가 고정된 형식 명제에 의존하던 것과 달리, 형식화 자체의 의미 충실도를 컴파일률과 분리해 측정해야 함을 드러낸다. 컴파일은 통과해도 가설 누락·영역 변경·공허한 명제가 가능하다는 점에서, 형식화 파이프라인 평가 기준을 재정립한다.

🏷️ 언급 프로젝트

Lean Mathlib

본문 미리보기

arXiv:2606.31002v1 Announce Type: new Abstract: Theorem-proving benchmarks evaluate proof search against fixed formal statements, but natural-language-to-Lean formalization must generate the formal statement itself. In this setting, compilation is only a validity check: a Lean declaration may type-check while omitting hypotheses, changing domains, or expressing a vacuous claim. We study faithful statement formalization as both an evaluation problem and a bottleneck-attribution problem. On a 400

전체 내용이 궁금하다면?

원문을 직접 읽어보세요

원문 보기

#자연어 처리#형식화#정리 증명#AI 평가#언어 모델

Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization

본문 미리보기

관련 글

When Regulation Has Memory: Hysteresis and Control Burden in Artificial Agency

DDIAgents: Mechanism-Conditioned Context Flow for Drug-Drug Interaction Prediction

A Three-Phase Foundation Model for Tax-Aware Personalized Portfolio Management

AgentBound: Verifiable Behavioral Governance for Autonomous AI Agents