Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety
본문 미리보기
arXiv:2606.06529v1 Announce Type: new Abstract: An attacker that strategically chooses when to attack is much harder to catch than one that attacks indiscriminately. AI control is a safety framework for deploying capable but untrusted AI agents under the oversight of a weaker, trusted monitor and a limited human audit budget. Control evaluations stress-test these protocols by pitting a red-team attack policy against the blue-team monitor, but current evaluations typically assume attackers that
전체 내용이 궁금하다면?
원문을 직접 읽어보세요