Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability | AIChainDay

🇰🇷 한국어 요약by Claude · 2026. 6. 16.

Metric Match는 제한된 주석만으로 LLM 심판(judge)의 상관 기반 신뢰성 지표를 추정하는 방법이다. 합성 라벨 대비 모집단 신뢰성 지표와 일치하도록 사람이 주석할 표본 부분집합을 선택해, 값비싼 인간 주석 비용을 줄인다. 15개 데이터셋·4개 상관 지표에서 무작위 부분집합 선택 대비 0.838의 승률을 기록했고 평균 추정 오차를 18.7% 낮추며 주석 필요량을 32.5% 줄였다. 의료 사례 연구에서는 전문가 주석 비용을 무작위 대비 $1,041.67 절감했고, 신뢰성 추정뿐 아니라 배포 임계값 초과 여부를 가리는 분류 과제에서도 무작위 선택을 능가했다.

•제한된 인간 주석으로 LLM 심판의 상관 기반 신뢰성 지표를 추정하는 Metric Match 제안
•합성 라벨 대비 모집단 지표와 일치하는 주석 표본 부분집합 선택
•15개 데이터셋·4개 상관 지표에서 무작위 대비 승률 0.838, 추정 오차 18.7% 감소
•주석 필요량 32.5% 절감, 의료 사례에서 전문가 주석 비용 $1,041.67 절약
•신뢰성 추정뿐 아니라 배포 임계값 초과 여부 분류에서도 무작위 능가

AI2026년 6월 16일

Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

출처:arXiv cs.AI

본문 미리보기

arXiv:2606.15029v1 Announce Type: new Abstract: LLM judges are used to reduce the need for costly human labor in evaluating open-ended text generation. However, the reliability of these judges depends critically on their alignment with human raters -- a property that itself depends on costly human annotations. In this work, we develop a method (Metric Match) for estimating correlation-based reliability metrics of LLM judges from limited annotations. Metric Match selects a subset of samples for

전체 내용이 궁금하다면?

원문을 직접 읽어보세요

원문 보기

2시간 전

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

arXiv:2606. 14885v1 Announce Type: new Abstract: Agentic search over large corpora relies on retriever-mediated interfaces (e. g. , BM25 or ColBERT) for scalable candidate discovery. While effective at ranking relevant documents, these interfaces expose evidence only as ranked results or bounded doc

📰미디어arXiv cs.AI

원문

Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

본문 미리보기

관련 글

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

Relational Structural Causal Models

Cognitive Debt: AI as Intellectual Leverage and the Dynamics of Systemic Fragility

Towards Verifiable Agentic Data Science: Solving Irregular TSQA Via Tool-Grounded Reasoning