Anchor: Mitigating Artifact Drift in Agent Benchmark Generation | AIChainDay

🇰🇷 한국어 요약by Claude · 2026. 5. 27.

Anchor는 비즈니스 워크플로 전문가 명세를 제약 최적화 프로그램으로 형식화해 아티팩트 드리프트 문제를 해결하는 태스크 생성 파이프라인이다. 단일 파라미터 명세에서 자연어 지시·환경 구성·솔버 인증 정답·상태 기반 검증기를 동시에 생성한다. Anchor로 구축한 ERP-Bench는 생산 수준 ERP 시스템의 조달·제조 워크플로 300개 장기 태스크 벤치마크다. 프론티어 모델이 명시 제약을 충족한 비율은 26.1%, 완전 최적 해에 도달한 비율은 17.4%에 불과해 엔터프라이즈 에이전트 작업의 높은 난이도를 드러냈다.

•아티팩트 드리프트: 지시·환경·오라클·검증기가 느슨하게 결합된 프로세스로 생성되면 불일치해 벤치마크가 해결 불가·보상 해킹 가능·비일관적이 되는 현상.
•Anchor: 단일 파라미터 명세에서 자연어 지시·환경 구성·솔버 인증 정답·상태 기반 검증기를 동시 생성해 드리프트 차단.
•ERP-Bench: 생산 수준 ERP 시스템의 조달·제조 워크플로 300개 장기 태스크. 생성 파라미터가 실현 난이도를 예측.
•프론티어 모델의 명시 제약 충족률 26.1%, 완전 최적 해 도달률 17.4% — 엔터프라이즈 에이전트 작업의 높은 난이도 확인.

AI2026년 5월 27일AI 점수: 92%

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

출처:arXiv cs.AI

✨ AI 인사이트

🧑‍💻 개발자💼 투자자

1.Anchor: 비즈니스 워크플로우를 제약 최적화 프로그램으로 형식화해 아티팩트 드리프트를 방지하는 태스크 생성 파이프라인
2.ERP-Bench: 조달·제조 워크플로우 기반 300개 장기 태스크 벤치마크 공개
3.최전선 모델들의 명시적 제약 충족률 26.1%, 완전 최적 솔루션 도달 률 17.4%에 그쳔
4.단일 명세에서 지시문·환경·검증기·최적해를 동시 생성해 감사 가능한 평가 환경 제공

💡

왜 중요한가?

에이전트 벤치마크의 아티팩트 드리프트 문제를 제약 최적화로 구조적으로 해결해, 기업 업무 자동화 에이전트 평가의 신뢰성을 높이는 새 기준을 제시한다.

🏷️ 언급 프로젝트

Anchor ERP-Bench

본문 미리보기

arXiv:2605.26321v1 Announce Type: new Abstract: AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism, verifiability, and scale. Environment and task creation frequently suffers from a failure mode we call artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires, pr

전체 내용이 궁금하다면?

원문을 직접 읽어보세요

원문 보기

#AI 에이전트#벤치마크 생성#품질 보증#아티팩트 일관성

8시간 전

Thousand Token Wood: shipping a multi-agent economy on a 3B model

🏢공식HuggingFace Blog

원문

1일 전

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

arXiv:2606. 05384v1 Announce Type: new Abstract: LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumpti

📰미디어arXiv cs.AI

원문

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

본문 미리보기

관련 글

Thousand Token Wood: shipping a multi-agent economy on a 3B model

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)

SentinelBench: A Benchmark for Long-Running Monitoring Agents