MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis | AIChainDay

🇰🇷 한국어 요약by Claude · 2026. 6. 15.

MA-ProofBench는 수리해석학(Mathematical Analysis) 전용으로는 최초로 알려진 형식 정리증명 벤치마크로, 측도·적분론, 복소해석, 함수해석 등 6개 핵심 주제와 27개 하위범주를 아우르는 형식화된 정리 200개를 담는다. 문제는 학부 수준(레벨 I, 100개)과 박사 자격시험 수준(레벨 II, 100개)으로 나뉘며, 각 문제는 사람 주도·LLM 보조 형식화 파이프라인과 독립 전문가 검토를 거쳐 원래 수학에 충실하도록 구성됐다. 최신 추론 모델과 형식 증명기를 평가한 결과 대부분 저조했는데, 최고 성능 GPT-5.5조차 레벨 I에서 Pass@8 16%, 레벨 II에서 5%에 그쳤고 대부분 모델은 레벨 II에서 0%에 가까웠다. 주된 실패 원인은 Mathlib 환각과 미완성 증명이었으며, 고난도 영역의 형식 수학 추론 진전을 추적할 신뢰 기준으로 제시된다.

•수리해석학 전용 최초의 형식 정리증명 벤치마크, 6개 주제·27개 하위범주의 형식화 정리 200개
•학부(레벨 I)와 박사 자격시험(레벨 II) 두 난이도로 구성, 사람 주도·전문가 검토 파이프라인
•최고 성능 GPT-5.5도 레벨 I Pass@8 16%, 레벨 II 5%에 그쳐 대부분 모델이 고난이도에서 사실상 실패
•주요 실패 원인은 Mathlib 환각과 미완성 증명
•자연어 버전 평가에서 비형식 추론과 형식 추론 사이의 명확한 격차 드러남

AI2026년 6월 15일AI 점수: 91%

MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

출처:arXiv cs.AI

본문 미리보기

arXiv:2606.13782v1 Announce Type: new Abstract: Large Language Models (LLMs) have made notable progress in automated theorem proving, yet existing formal benchmarks remain limited in both mathematical coverage and difficulty. Most are concentrated in areas that are easier to formalize, such as algebra and elementary number theory, and provide limited coverage of subfields that require deeper reasoning, including mathematical analysis. To address this gap, we introduce MA-ProofBench, to the best

전체 내용이 궁금하다면?

원문을 직접 읽어보세요

원문 보기

#LLM#정리 증명#벤치마크#수학 해석

3시간 전

When Sample Selection Bias Precipitates Model Collapse

arXiv:2606. 13732v1 Announce Type: new Abstract: The proliferation of recursive training on synthetic data can alleviate data scarcity but risks model collapse, where repeated training erodes distributional tails and homogenizes outputs. Data selection is widely viewed as a remedy, yet its reliabili

#모델 붕괴#합성 데이터#데이터 선택

📰미디어arXiv cs.AI

원문

MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

본문 미리보기

관련 글

When Sample Selection Bias Precipitates Model Collapse

Hyperdimensional computing for structured querying on tabular data embeddings

AI Receptivity or AI Adoption Breadth? A Tool-Specific Reanalysis of the Lower-Literacy/Higher-Usage Link

Minim: Privacy-Aware Minimal View for Agents via Trusted Local Sanitization