DEMM-Bench: A Cross-Regime Benchmark for Agent-Runtime Governance-Evidence Sufficiency | AIChainDay

🇰🇷 한국어 요약by Claude · 2026. 6. 23.

에이전트 런타임 시스템은 트레이스·원장·출처 그래프·정책 로그 등 다양한 기록을 내보내지만 특정 결정에 대한 거버넌스 질문에 답하지는 못한다. DEMM-Bench는 결정 증거 성숙도 모델(DEMM)에 기반한 교차 체제 벤치마크로, 여덟 증거 체제의 기록이 결정 수준 속성을 단지 '존재'하는 것을 넘어 '재구성'하기에 충분한지를 측정한다. 행위자·권한·행동·정책·결정 근거·자원 접촉·생애주기 맥락·검증 강도에 대한 속성 질문을 던지고 여덟 가지 결정적 열화 조건을 적용한다. 64개 사례에서 트레이스·스키마 기반 베이스라인은 75%, 원장 기반은 50% 사례에서 과대주장했으나, 재작성된 속성 수준 채점기는 과대주장이 0이고 평균 속성 충분성 정확도 56.25%를 기록했다.

•여덟 증거 체제 기록이 결정 수준 속성을 재구성하기에 충분한지 측정하는 DEMM-Bench를 제안한다.
•행위자·권한·행동·정책 등 8개 속성 질문과 8개 결정적 열화 조건을 적용한다.
•64개 사례에서 트레이스·스키마 베이스라인은 75%, 원장 기반은 50%가 과대주장했다.
•속성 수준 채점기는 과대주장 0, 평균 속성 충분성 정확도 56.25%를 달성했다.

AI2026년 6월 23일

DEMM-Bench: A Cross-Regime Benchmark for Agent-Runtime Governance-Evidence Sufficiency

출처:arXiv cs.AI

본문 미리보기

arXiv:2606.20634v1 Announce Type: new Abstract: Agent-runtime systems emit traces, ledgers, provenance graphs, policy logs, delegation tokens, cache events, and tool-firewall records, but those containers do not necessarily answer governance questions about a specific decision. DEMM-Bench is a cross-regime benchmark for agent-runtime governance-evidence sufficiency, grounded in the Decision Evidence Maturity Model (DEMM): it measures whether records across eight evidence regimes are sufficient

전체 내용이 궁금하다면?

원문을 직접 읽어보세요

원문 보기

2시간 전

AlphaMemo: Structured Search-Process Memory for Self-Evolving Alpha Mining Agents

arXiv:2606. 20625v1 Announce Type: new Abstract: LLM agents are promising for alpha mining via combining financial priors, symbolic reasoning, executable factor generation, and feedback-driven refinement. Yet, they face a combinatorial search space, noisy non-stationary feedback, redundant discoveri

📰미디어arXiv cs.AI

원문

DEMM-Bench: A Cross-Regime Benchmark for Agent-Runtime Governance-Evidence Sufficiency

본문 미리보기

관련 글

AlphaMemo: Structured Search-Process Memory for Self-Evolving Alpha Mining Agents

Path-dependent program induction under resource constraints explains human sequence learning

Hypothesis-Disciplined Multi-Agent Automated Formalization of Asymptotic Statistical Theory

SPARC: A Multi-Agent System for Electrical Circuit Question Answering