ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence | AIChainDay

🇰🇷 한국어 요약by Claude · 2026. 5. 27.

ScientistOne은 모든 주장에 증거 출처를 추적 가능하게 강제하는 Chain-of-Evidence(CoE) 프레임워크 기반의 완전 자율 연구 에이전트다. 기존 자율 연구 에이전트들이 할루시네이션 인용률 최대 21%, 점수 검증 통과율 최저 42%를 기록한 것과 달리, ScientistOne은 할루시네이션 인용 0건(0/337), 점수 검증 100%(12/12), 최고 방법-코드 정합성(14/15)을 달성했다. 5개 프론티어 연구 태스크에서 인간 전문가 성능과 동등하거나 이를 초과했고, 의료 영상·3D 인식 등 6개 추가 태스크로도 일반화되며 MLE-Bench 금메달을 수상했다. AI 연구 에이전트의 검증 가능성 문제를 해결하는 중요한 진전이다.

•Chain-of-Evidence(CoE): 문헌 검토·솔루션 발견·논문 작성 전 과정에서 모든 주장이 증거 출처와 연결되도록 강제하는 검증 가능성 프레임워크.
•ScientistOne: 할루시네이션 인용 0건(0/337), 점수 검증 100%(12/12), 방법-코드 정합성 최고(14/15)로 모든 기존 베이스라인 압도.
•5개 시스템·5개 태스크 75편 논문 분석 결과 모든 기존 베이스라인에서 체계적 실패 모드 확인(인용 오류율 최대 21%, 점수 검증 최저 42%).
•의료 영상·미세입자 인식·3D 인식·언어 모델링 등 6개 추가 태스크로 일반화 성공, MLE-Bench에서 금메달 달성.

AI2026년 5월 27일AI 점수: 94%

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

출처:arXiv cs.AI

✨ AI 인사이트

🧑‍💻 개발자💼 투자자

1.ScientistOne: 원본 근거 링크를 유지하는 CoE 검증 프레임워크와 종단간 자율 연구 시스템 공개
2.5개 시스템 75편 논문 평가: 기존 시스템 환각 인용 최대 21%, 점수 검증 통과 42% 수준
3.ScientistOne: 환각 인용 0건(0/337)·점수 검증 100%·코드 정합성 최고 달성
4.MLE-Bench 태스크 금메달 수준 성과, 의료 영상 등 6개 추가 태스크에서 SOTA 달성

💡

왜 중요한가?

AI 과학 에이전트의 핵심 취약점인 검증 가능성 실패를 체계적으로 규명하고 구조적으로 해결한 ScientistOne은, AI 주도 연구 자동화의 신뢰성 기준을 새로 정립한다.

🏷️ 언급 프로젝트

ScientistOne MLE-Bench

본문 미리보기

arXiv:2605.26340v1 Announce Type: new Abstract: Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence s

전체 내용이 궁금하다면?

원문을 직접 읽어보세요

원문 보기

#자율 연구 에이전트#검증 가능성#AI 과학 자동화#연구 재현성

8시간 전

Thousand Token Wood: shipping a multi-agent economy on a 3B model

🏢공식HuggingFace Blog

원문

1일 전

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

arXiv:2606. 05384v1 Announce Type: new Abstract: LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumpti

📰미디어arXiv cs.AI

원문

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

본문 미리보기

관련 글

Thousand Token Wood: shipping a multi-agent economy on a 3B model

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)

SentinelBench: A Benchmark for Long-Running Monitoring Agents