Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP | AIChainDay

🇰🇷 한국어 요약by Claude · 2026. 6. 15.

이 연구는 안전 파인튜닝 챗 모델의 거부(refusal)가 잔차 스트림의 단일 선형 방향으로 매개된다는 Arditi 외(2024)의 결과를 출발점으로, 차분평균(DiM) 기반 개입과 반복 영공간 투영(INLP) 기반 개입을 다섯 개 오픈웨이트 모델에서 비교한다. INLP의 반사실적 뒤집기(counterfactual flipping)는 거부 억제에서 DiM 방향 제거와 경쟁력이 있는 반면, 영공간 투영은 일관되게 약했다. INLP를 추출 부분공간의 주요 방향으로 제한하면 거의 기준선 수준의 퍼플렉시티를 유지하면서 억제 효과 대부분을 보존해 조절 가능한 기법이 된다. 기하학적으로 영공간 투영은 활성값을 유해·무해 군집 사이로 붕괴시키는 반면 반사실적 뒤집기는 반대 군집으로 이동시켜, 모델이 개념의 '부재'와 '반대'를 다르게 부호화함을 시사한다.

•거부(refusal) 조종에서 DiM 기반 개입과 INLP 기반 개입을 5개 오픈웨이트 모델에서 비교
•INLP 반사실적 뒤집기는 DiM 방향 제거와 거부 억제 성능이 대등, 영공간 투영은 일관되게 약함
•INLP를 주요 방향으로 제한하면 기준선 수준 퍼플렉시티로 억제 효과 대부분 보존—조절 가능한 기법
•영공간 투영은 활성값을 유해·무해 군집 사이로, 반사실적 뒤집기는 반대 군집으로 이동
•모델이 개념의 '부재'와 '반대'를 다르게 부호함을 시사

AI2026년 6월 15일AI 점수: 90%

Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

출처:arXiv cs.AI

본문 미리보기

arXiv:2606.13720v1 Announce Type: new Abstract: Arditi et al. (2024) has shown that refusal in safety fine-tuned chat models is mediated by a single linear direction in the residual stream, recoverable by a difference-in-means (DiM) of harmful and harmless activations. We compare DiM-based interventions (activation addition and directional ablation) with two interventions derived from Iterative Nullspace Projection (INLP) -- nullspace projection and counterfactual flipping -- on five open-weigh

전체 내용이 궁금하다면?

원문을 직접 읽어보세요

원문 보기

#AI 안전성#거부 메커니즘#해석가능성#활성화 개입

3시간 전

When Sample Selection Bias Precipitates Model Collapse

arXiv:2606. 13732v1 Announce Type: new Abstract: The proliferation of recursive training on synthetic data can alleviate data scarcity but risks model collapse, where repeated training erodes distributional tails and homogenizes outputs. Data selection is widely viewed as a remedy, yet its reliabili

#모델 붕괴#합성 데이터#데이터 선택

📰미디어arXiv cs.AI

원문

Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

본문 미리보기

관련 글

When Sample Selection Bias Precipitates Model Collapse

Hyperdimensional computing for structured querying on tabular data embeddings

AI Receptivity or AI Adoption Breadth? A Tool-Specific Reanalysis of the Lower-Literacy/Higher-Usage Link

Minim: Privacy-Aware Minimal View for Agents via Trusted Local Sanitization