Cross-Entropy Games and Frost Training | AIChainDay

🇰🇷 한국어 요약by Claude · 2026. 5. 28.

Frost Training은 LLM-as-a-judge 패밀리 태스크(Cross-Entropy Games)에서 Monte Carlo 기반 정책 최적화를 개선하는 훈련 방법으로, 임베딩 공간의 보상 함수 그래디언트를 활용한다. GCG 탈옥 기법에서 사용되던 이 신호를 처음으로 모델 훈련 강화에 전용했으며, GRPO로 최대우도 인필링(infilling) 훈련 시 best-of-k 최고 점수를 높이고 수렴 속도도 개선했다. 공격 기법의 신호를 역이용해 모델 성능을 향상시킨다는 점에서 창의적인 접근이다.

•임베딩 공간의 보상 그래디언트를 모델 훈련에 활용한 최초 사례로, GCG 탈옥 기법에서 파생됐다.
•GRPO 기반 최대우도 인필링 훈련에서 best-of-k 최고 점수와 수렴 속도를 동시에 향상했다.
•Cross-Entropy Games라는 LLM-as-a-judge 태스크 패밀리를 새롭게 정의하고 Frost Training을 검증했다.

AI2026년 5월 28일AI 점수: 92%

Cross-Entropy Games and Frost Training

출처:arXiv cs.AI

✨ AI 인사이트

🧑‍💻 개발자

1.Frost Training은 LLM-as-a-judge(크로스 엔트로피 게임) 태스크를 위한 Monte Carlo 정책 최적화 개선 방법
2.GCG 탈옥 기법에서 착안한 임베딩 공간 보상 함수 그래디언트를 훈련에 처음으로 활용
3.GRPO 훈련과 결합 시 best-of-k 설정에서 더 높은 최대 점수 달성, 훈련 속도도 향상

💡

왜 중요한가?

탈옥 공격 기법(GCG)의 그래디언트 신호를 정책 최적화에 재활용한 발상이 참신하며, LLM-as-a-judge 훈련의 효율을 높이는 데 바로 적용 가능한 실용적 접근이다.

🏷️ 언급 프로젝트

Frost Training GRPO

본문 미리보기

arXiv:2605.27701v1 Announce Type: new Abstract: We present Frost Training, a method for improving Monte Carlo-based policy optimization for a large family of LLM-as-a-judge tasks called Cross-Entropy Games. The key idea is to exploit the gradient of the reward function in embedding space. This signal is used in the Greedy Coordinate Gradient (GCG) jailbreaking technique; we demonstrate for the first time that it can also be used to boost model training. We validate our method using GRPO trainin

전체 내용이 궁금하다면?

원문을 직접 읽어보세요

원문 보기

#LLM 훈련 최적화#교차 엔트로피#정책 최적화#GRPO 학습#모델 파인튜닝

8시간 전

Thousand Token Wood: shipping a multi-agent economy on a 3B model

🏢공식HuggingFace Blog

원문

1일 전

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

arXiv:2606. 05384v1 Announce Type: new Abstract: LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumpti

📰미디어arXiv cs.AI

원문

Cross-Entropy Games and Frost Training

본문 미리보기

관련 글

Thousand Token Wood: shipping a multi-agent economy on a 3B model

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)

SentinelBench: A Benchmark for Long-Running Monitoring Agents