MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution | AIChainDay

🇰🇷 한국어 요약by Claude · 2026. 6. 2.

멀티 에이전트 전략 게임에서 언어 모델 훈련의 핵심 난제는 행동 품질이 미래 사건·규칙 위반·타 플레이어 결정에 의존한다는 점이다. 이 연구는 에피소드 종료 시 보상을 역전파하고 유효한 종속 정보가 없는 스텝을 훈련에서 제외하는 '적격 게이팅을 가진 지연 단계별 보상 귀인' 방법을 도입한다. NeurIPS 2025 MindGames Arena에서 8B 파라미터 오픈소스 모델이 GPT-5를 포함한 훨씬 큰 독점 모델들을 능가하며 공개·효율(≤8B) 트랙 양쪽에서 1위를 차지했다.

•에피소드 종료 시 보상을 역전파하고 유효한 종속 정보가 없는 스텝을 제외하는 지연 단계별 보상 귀인으로 멀티 에이전트 환경의 RL 크레딧 할당 문제를 해결한다.
•vLLM 연속 배칭 비동기 롤아웃 생성, 커리큐럼 기반 상대 샘플링, 다단계 계층화 배치 구성으로 안정적이고 샘플 효율적인 RL 학습을 달성한다.
•8B 파라미터 오픈소스 모델이 GPT-5를 포함한 대형 독점 모델을 능가해, 방법론적 혁신이 모델 크기를 대체할 수 있음을 입증했다.
•NeurIPS 2025 MindGames Arena 공개 트랙과 효율(≤8B) 트랙 양쪽에서 1위를 기록했다.

AI2026년 6월 2일

MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

출처:arXiv cs.AI

본문 미리보기

arXiv:2606.00017v1 Announce Type: new Abstract: Training language model agents for multi-agent strategic interaction presents a core difficulty: the quality of any action may depend on future events that never materialize, on moves that violate game rules, or on decisions made by other players. Standard reinforcement learning assumes that rewards can be assigned at each step, but this assumption fails in settings where outcomes are entangled across time and agents. We introduce delayed per-step

전체 내용이 궁금하다면?

원문을 직접 읽어보세요

원문 보기

8시간 전

Thousand Token Wood: shipping a multi-agent economy on a 3B model

🏢공식HuggingFace Blog

원문

1일 전

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

arXiv:2606. 05384v1 Announce Type: new Abstract: LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumpti

📰미디어arXiv cs.AI

원문

MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

본문 미리보기

관련 글

Thousand Token Wood: shipping a multi-agent economy on a 3B model

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)

SentinelBench: A Benchmark for Long-Running Monitoring Agents