HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

본문 미리보기

arXiv:2607.00572v1 Announce Type: new Abstract: Understanding how aligned LLMs internally represent safety is critical for diagnosing alignment vulnerabilities, as it explains why jailbreaks succeed and informs the design of robust alignment strategies. Prior work shows that aligned LLMs encode harmfulness and refusal as separable directions in the residual stream at prompt-side token positions. We show that jailbreaks succeed at prompt encoding by suppressing either the refusal or harmfulness

HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

본문 미리보기

관련 글

Constructive Alignment: Governing Preference Dynamics in Human-AI Interaction

Bounded Morality: Defining the Space of Moral Computation

The MMM Data Model -- A Normative Specification for Knowledge Interoperability in a Decentralisable Knowledge Commons

A Contextual-Bandit Oversight Game with Two-Sided Informational Asymmetry