Detecting and Controlling Sycophancy with Cascading Linear Features

본문 미리보기

arXiv:2606.26155v1 Announce Type: new Abstract: Interpreting and controlling model behaviors through activation steering methods requires many pairs of contrastive samples that clearly exhibit desired or undesired behavior. These data pairs determine the degree to which interpretability frameworks can reliably detect model features responsible for a behavior, and therefore the ability to steer models toward or away from such behavior. In this work, we present an iterative data generation pipeli

Detecting and Controlling Sycophancy with Cascading Linear Features

본문 미리보기

관련 글

COrigami: An AI Pipeline for Co-Designing Flat-Foldable Visually Recognisable Origami

Accelerating Returns and the Qualitative Engine for Science

AlgoEvolve: LLM-driven Meta-evolution of Algorithmic Trading Programs

Agentic Analysis for Agentic Infrastructure: An LLM-Powered Pipeline for Comparative Governance of DAO and Corporate AI Protocols