Refusal Lives Downstream of Persona in Chat Models

본문 미리보기

arXiv:2606.26161v1 Announce Type: new Abstract: Linear directions in activation space have been identified for both refusal and persona traits in instruction-tuned chat models, but the two have been studied as separate mechanisms. We show they interact: a compliant persona gates refusal. In Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, we extract a compliant model-persona direction and a refusal direction and intervene on both. Compliant persona steering suppresses refusal -- in Llama, the ref

Refusal Lives Downstream of Persona in Chat Models

본문 미리보기

관련 글

COrigami: An AI Pipeline for Co-Designing Flat-Foldable Visually Recognisable Origami

Accelerating Returns and the Qualitative Engine for Science

AlgoEvolve: LLM-driven Meta-evolution of Algorithmic Trading Programs

Agentic Analysis for Agentic Infrastructure: An LLM-Powered Pipeline for Comparative Governance of DAO and Corporate AI Protocols