OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling | AIChainDay

🇰🇷 한국어 요약by Claude · 2026. 5. 27.

OmniToM은 LLM의 마음 이론(Theory of Mind) 능력을 서사 내 모든 관련 행위자의 믿음 구조를 명시적으로 모델링하도록 요구해 평가하는 벤치마크다. 기존 종점 질의응답 방식으로는 모델이 실제로 정신 상태 표현을 구성하는지 알 수 없는 한계를 극복한다. 895개 ToMBench 스토리에서 22,343개 레이블링된 믿음 명제를 구성하며, 재귀적 차수·진실 상태·지식 접근·명시성·내용 유형·정신 출처·맥락 등 7차원 스키마로 레이블링한다. 제로샷 평가에서 현재 LLM이 서사 사실을 행위자 믿음으로 변환하는 지식 접근 판단에서 특히 취약한 행위자별 믿음 추적 병목이 드러났다.

•OmniToM은 서사 내 모든 관련 행위자의 릿음 구조를 명시적으로 추출하고 레이블링하도록 요구해 종점 QA 방식의 한계를 극복.
•895개 스토리에서 22,343개 레이블링된 맿음 명제 구성, 7차원 스키마(재귀적 차수·진실 상태·지식 접근 등) 적용.
•두 단계 평가: 1단계 맿음 추출, 2단계 7차원 스키마 레이블링으로 마음 이론의 세분화된 분석 가능.
•제로샷 평가에서 현재 LLM이 서사 사실을 행위자 맿음으로 변환하는 지식-접근 판단에서 특히 취약한 행위자별 맿음 추적 병목 드러남.

AI2026년 5월 27일AI 점수: 94%

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

출처:arXiv cs.AI

✨ AI 인사이트

🧑‍💻 개발자

1.OmniToM: 신념 구조를 명시적으로 추출·라벨링하는 ToM 능력 평가 벤치마크 도입
2.895편 스토리, 22,343개 라벨링된 신념 명제 포함한 인간 보정 데이터셋 구축
3.7차원 신념 라벨 스키마로 현 LLM의 지식 접근 및 신념 추적 병목 균명

💡

왜 중요한가?

기존 QA 방식의 ToM 평가가 LLM의 실제 정신 상태 표현 능력을 측정하지 못했음을 지적하고, 신념 명제 수준의 세분화된 평가로 소셜 AI 연구의 정밀도를 높인다.

🏷️ 언급 프로젝트

OmniToM ToMBench

본문 미리보기

arXiv:2605.26322v1 Announce Type: new Abstract: Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions, is commonly evaluated in large language models (LLMs) using end-point question answering, where performance is judged solely by the final answer to a social reasoning query. This paradigm obscures whether the model actually constructs the underlying mental-state representations required for robust reasoning, particularly in scenarios involving divergent, evolvi

전체 내용이 궁금하다면?

원문을 직접 읽어보세요

원문 보기

#마음이론#LLM 벤치마크#사회적 추론#신념 모델링

8시간 전

Thousand Token Wood: shipping a multi-agent economy on a 3B model

🏢공식HuggingFace Blog

원문

1일 전

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

arXiv:2606. 05384v1 Announce Type: new Abstract: LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumpti

📰미디어arXiv cs.AI

원문

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

본문 미리보기

관련 글

Thousand Token Wood: shipping a multi-agent economy on a 3B model

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)

SentinelBench: A Benchmark for Long-Running Monitoring Agents