Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

본문 미리보기

arXiv:2606.06529v1 Announce Type: new Abstract: An attacker that strategically chooses when to attack is much harder to catch than one that attacks indiscriminately. AI control is a safety framework for deploying capable but untrusted AI agents under the oversight of a weaker, trusted monitor and a limited human audit budget. Control evaluations stress-test these protocols by pitting a red-team attack policy against the blue-team monitor, but current evaluations typically assume attackers that

Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

본문 미리보기

관련 글

Detecting and Mitigating Bias by Treating Fairness as a Symmetry Operation

DiBS: Diffusion-Informed Branch Selection

SafeGene: Reusable Adapters for Transferable Safety Alignment

Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory