Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

본문 미리보기

arXiv:2607.01480v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR), along with recent selfdistillation variants such as SDPO, evaluates each rollout against a verifier and updates the policy from that episode-level signal. However, the richer procedural information in the rollout is rarely retained or reused. Across episodes and epochs, the model repeatedly encounters related problems under a changing policy, producing cross-episode signals that episode-local

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

본문 미리보기

관련 글

Profit-Based Counterfactual Explanations for Product Improvement: A Case Study of Manga Sales in Japan

Scaling Trends for Lie Detector Oversight in Preference Learning

Hawk: Harnessing Hardware-Aware Knowledge for High-Performance NPU Kernel Generation

Discrete Diffusion Language Models for Interactive Radiology Report Drafting