BayesBench: Evaluating LLM Belief Trajectories Under Multi-Turn Evidence Accumulation

본문 미리보기

arXiv:2606.30850v1 Announce Type: new Abstract: Large language models (LLMs) are typically deployed in multi-turn conversations, where each turn provides new evidence that should reduce epistemic uncertainty about their environment. Acting rationally then requires inferring the unobserved quantities that govern it and updating beliefs about them as evidence accumulates. Yet most evaluations only score the model's final-turn answer in a single-turn format, leaving this process unexamined. We ask

BayesBench: Evaluating LLM Belief Trajectories Under Multi-Turn Evidence Accumulation

본문 미리보기

관련 글

What Drives Interactive Improvement from Feedback?

Contrastive Reflection for Iterative Prompt Optimization

How Can AI Find My Model? A Model-Finding Experimental Study Considering Data Formats, Embeddings, and Retrieval Strategies

When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models