Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning | AIChainDay

🇰🇷 한국어 요약by Claude · 2026. 6. 24.

SGPO(Strategy-Guided Policy Optimization)는 강한 모델에서 약한 모델로 추론 능력을 증류할 때 특정 풀이 궤적을 모방하는 대신 재사용 가능한 '전략'을 증류하는 프레임워크다. 기존 궤적 수준 모방은 사례별 단계를 암기하게 해 새 문제로의 일반화를 제한한다. SGPO는 강한 모델 응답에서 구조화된 전략 설명을 추출하고, 각 문제마다 자율 궤적과 전략 유도 궤적을 함께 만들어 전략 유무에 따른 행동을 직접 비교한다. '어떻게 증류할지'는 토큰 수준 forward-KL 목적함수로 전략 조건화가 유발한 분포 변화만 선택적으로 전이하고, '언제 증류할지'는 자율 탐색이 부족하면 유도를 강화하고 모델 역량이 커지면 줄이는 적응형 가중치로 조절한다. 두 모델 계열·네 수학 벤치마크에서 SFT·온폴리시 RL·하이브리드 베이스라인을 일관되게 능가했으며 Qwen2.5-7B-Instruct에서 최강 베이스라인 대비 평균 2.2점 향상됐다.

•구체적 권적 모방 대신 재사용 가능한 전략 증류로 일반화 개선
•자율 권적과 전략 유도 권적을 비교해 전략의 효과를 분리
•토큰 수준 forward-KL로 전략 조건화 분포 변화만 선택적 전이
•적응형 인스턴스 가중치로 모델 역량에 따라 유도 강도 조절
•4개 수학 벤치마크에서 Qwen2.5-7B-Instruct 기준 최강 베이스라인 대비 평균 2.2점 향상

AI2026년 6월 24일

Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning

출처:arXiv cs.AI

본문 미리보기

arXiv:2606.24064v1 Announce Type: new Abstract: Distilling reasoning capabilities from strong to weak language models typically involves imitating specific solution trajectories, effectively transferring what to answer rather than how to reason. This trajectory-level imitation encourages memorization of instance-specific steps rather than acquisition of transferable problem-solving skills, limiting generalization to novel problems. We propose Strategy-Guided Policy Optimization (SGPO), which re

전체 내용이 궁금하다면?

원문을 직접 읽어보세요

원문 보기

2시간 전

Critique of Agent Model

arXiv:2606. 23991v1 Announce Type: new Abstract: What is an agent? What constitutes agency? With the rise of Large Language Model (LLM) systems marketed as ``coding agents'', ``AI co-scientists'', and other ``agentic" tools that promise to drive up productivity, and at the same time, ``existential"

📰미디어arXiv cs.AI

원문

Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning

본문 미리보기

관련 글

Critique of Agent Model

Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?

Exploring Academic Influence of Algorithms by Co-occurrence Network Based on Full-text of Academic Papers

An Introduction to Causal Reinforcement Learning