WorkBench Revisited: Workplace Agents Two Years On | AIChainDay

🇰🇷 한국어 요약by Claude · 2026. 6. 15.

연구진은 직장 업무 자동화 벤치마크 WorkBench를 2026년 6월 시점에 재평가했다. 2024년 3월 최고 성능이던 GPT-4는 과제 43%를 완수하고 26%에서 잘못된 사람에게 이메일을 보내는 등 의도치 않은 유해 행동을 했으나, 현재 최고 모델 Claude Opus 4.8은 89%를 완수하고 유해 행동은 2.5%로 줄었다. 주목할 점은 세 가지다. 첫째, WorkBench에서는 성능과 안전이 상충하지 않고 함께 향상돼 가장 많이 완수한 모델이 피해도 가장 적었다. 둘째, 여러 오류 유형이 사라졌지만 프런티어 모델도 여전히 엉뚱한 수신자에게 메일을 보내는 기초적 실수로 되돌릴 수 없는 피해를 낸다. 셋째, 오픈웨이트 모델의 부상으로 기존 독점 모델급 성능의 비용이 크게 낮아졌다.

•Claude Opus 4.8이 WorkBench 과제 89% 완수, 의도치 않은 유해 행동은 2.5%로 2024년 GPT-4(43%/26%) 대비 큰 진전
•성능과 안전이 상충하지 않고 함께 향상—과제를 많이 끊는 모델이 피해도 가장 적음
•여러 오류 유형이 사라졌으나 잘못된 수신자에게 메일 전송 같은 되돌릴 수 없는 기초 실수는 잔존
•오픈웨이트 모델 확산으로 동일 성능 수준의 비용이 크게 하락, 프런티어 비용은 안정적 유지
•데이터·코드 품질을 개선한 업데이트 벤치마크 및 신규 모델 점수·분석 공개

AI2026년 6월 15일AI 점수: 90%

WorkBench Revisited: Workplace Agents Two Years On

출처:arXiv cs.AI

본문 미리보기

arXiv:2606.13715v1 Announce Type: new Abstract: The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best agent to date, Claude Opus 4.8, completes 89% and takes an unintended harmful action on 2.5%. Aside from this considerable progress in frontier agent performance, three things stand out. First, capability and safety go togethe

전체 내용이 궁금하다면?

원문을 직접 읽어보세요

원문 보기

#AI 에이전트#벤치마크#업무 자동화#에이전트 안전

3시간 전

When Sample Selection Bias Precipitates Model Collapse

arXiv:2606. 13732v1 Announce Type: new Abstract: The proliferation of recursive training on synthetic data can alleviate data scarcity but risks model collapse, where repeated training erodes distributional tails and homogenizes outputs. Data selection is widely viewed as a remedy, yet its reliabili

#모델 붕괴#합성 데이터#데이터 선택

📰미디어arXiv cs.AI

원문

WorkBench Revisited: Workplace Agents Two Years On

본문 미리보기

관련 글

When Sample Selection Bias Precipitates Model Collapse

Hyperdimensional computing for structured querying on tabular data embeddings

AI Receptivity or AI Adoption Breadth? A Tool-Specific Reanalysis of the Lower-Literacy/Higher-Usage Link

Minim: Privacy-Aware Minimal View for Agents via Trusted Local Sanitization