Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability? | AIChainDay

🇰🇷 한국어 요약by Claude · 2026. 6. 24.

이 연구는 이미 식별된 회로(circuit)를 LLM 에이전트가 자동으로 설명할 수 있는지 검증한다. 연구진은 84개의 반합성 트랜스포머 회로와 163개의 컴포넌트 수준 주석으로 구성된 벤치마크 AgenticInterpBench를 구축하고, 관찰·가설 생성·인과 검증의 반복 루프로 각 컴포넌트를 분석하는 에이전트형 설명기 HyVE(Hypothesize, Validate, Explain)를 제안했다. 네 개의 LM 백본 모두 유용한 컴포넌트·과제 수준 설명을 복원했으나 일관되게 최고인 백본은 없었다. 분석 결과 강한 백본은 관찰에 근거한 가설을 잘 세우지만, 실패는 주로 불완전한 검증 계획·코드 실행 오류·미해결 가설 등 검증 단계 후반에서 발생했다. Llama-3-8B 산술 회로 사례 연구로 실제 학습 모델로의 확장 가능성도 보였으며, LM 에이전트가 유망한 회로 설명기이되 신뢰할 수 있는 검증이 핵심 과제임을 시사한다.

•84개 반합성 회로·163개 컴포넌트 주석의 벤치마크 AgenticInterpBench 구축
•관찰·가설 생성·인과 검증 반복 루프의 에이전트형 설명기 HyVE 제안
•네 LM 백본 모두 유용한 설명 복원, 단 일관된 최고 백본은 없음
•실패는 가설 생성보다 검증 루프 후반(불완전 검증·코드 오류)에서 주로 발생
•Llama-3-8B 산술 회로 사례로 실제 학습 모델 확장 가능성 입증

AI2026년 6월 24일

Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?

출처:arXiv cs.AI

본문 미리보기

arXiv:2606.24026v1 Announce Type: new Abstract: Mechanistic interpretability has made substantial progress in automatically localizing circuits, but explaining what localized components do remains labor-intensive and difficult to standardize. In this work, we study whether language model (LM) agents can assist with this explanation problem once a circuit has already been identified. We introduce AgenticInterpBench, a benchmark for circuit explanation built from 84 semi-synthetic transformer cir

전체 내용이 궁금하다면?

원문을 직접 읽어보세요

원문 보기

2시간 전

Critique of Agent Model

arXiv:2606. 23991v1 Announce Type: new Abstract: What is an agent? What constitutes agency? With the rise of Large Language Model (LLM) systems marketed as ``coding agents'', ``AI co-scientists'', and other ``agentic" tools that promise to drive up productivity, and at the same time, ``existential"

📰미디어arXiv cs.AI

원문

Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?

본문 미리보기

관련 글

Critique of Agent Model

Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning

Exploring Academic Influence of Algorithms by Co-occurrence Network Based on Full-text of Academic Papers

An Introduction to Causal Reinforcement Learning