Distilling LLM Feedback for Lean Theorem Proving | AIChainDay

🇰🇷 한국어 요약by Claude · 2026. 6. 1.

추론 모델 사후 학습에서 GRPO의 희소 보상·제한적 탐색·모드 붕괴 문제를 해결하는 Feedback Distillation 학습 방법을 제안했다. 이 방법은 모델이 LLM이 생성한 특권 피드백(privileged feedback)에 조건화된 자신의 분포를 토큰 수준에서 매칭하도록 학습하며, 외부 지식 주입도 가능하다. Lean4 정리 증명 평가에서 GRPO 대비 더 높은 생성 궤적 다양성, 높은 정책 엔트로피, 향상된 pass@k 스케일링을 보였다. Feedback Distillation 체크포인트에서 GRPO를 초기화하면 두 방법 단독 사용보다 우수한 성능을 달성해, 복잡한 추론 모델 후처리의 실용적 조합 전략을 제시한다.

•Feedback Distillation은 LLM 생성 피드백에 조건화된 자신의 분포를 토큰 수준에서 매칭하도록 학습해 GRPO의 희소 보상 문제를 해결한다.
•Lean4 정리 증명에서 GRPO 대비 생성 군적 다양성이 높고 pass@k 스케일링이 우수함을 실증했다.
•Feedback Distillation 체크포인트로 GRPO를 초기화하면 두 방법 단독 대비 추가 성능 향상이 가능해 복잡한 추론 모델 후처리의 실용적 조합 전략을 제시한다.

AI2026년 6월 1일AI 점수: 92%

Distilling LLM Feedback for Lean Theorem Proving

출처:arXiv cs.AI

✨ AI 인사이트

🧑‍💻 개발자

1.GRPO의 희소 보상·모드 붕괴 문제를 극복하는 Feedback Distillation 학습 기법 제안
2.토큰 레벨 지도로 생성 다양성을 유지하며 Lean4 정리 증명에서 GRPO 대비 높은 pass@k 달성
3.Feedback Distillation 체크포인트로 GRPO를 초기화하면 두 방법 단독 대비 최고 성능 기록

💡

왜 중요한가?

강화학습 기반 수학·코드 추론 모델 훈련의 핵심 병목인 희소 보상과 탐색 제한을 자기 증류로 완화하는 접근으로, 정리 증명뿐 아니라 복잡 추론 전반의 포스트트레이닝 전략으로 확장 가능하다.

본문 미리보기

arXiv:2605.30861v1 Announce Type: new Abstract: Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privilege

전체 내용이 궁금하다면?

원문을 직접 읽어보세요

원문 보기

#LLM파인튜닝#정리증명#강화학습#지식증류#추론모델

8시간 전

Thousand Token Wood: shipping a multi-agent economy on a 3B model

🏢공식HuggingFace Blog

원문

1일 전

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

arXiv:2606. 05384v1 Announce Type: new Abstract: LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumpti

📰미디어arXiv cs.AI

원문

Distilling LLM Feedback for Lean Theorem Proving

본문 미리보기

관련 글

Thousand Token Wood: shipping a multi-agent economy on a 3B model

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)

SentinelBench: A Benchmark for Long-Running Monitoring Agents