PHREEQC-MCQ-200: A Diagnostic Benchmark for Tool-Augmented Scientific Simulator Agents

본문 미리보기

arXiv:2607.00436v1 Announce Type: new Abstract: Large language model agents are increasingly connected to scientific software, yet it remains unclear when tool access makes scientific computation more reliable rather than merely more complex. We introduce PHREEQC-MCQ-200, a benchmark for evaluating tool-augmented agents on deterministic aqueous-geochemistry simulations. The benchmark contains 200 multiple-choice questions derived from 21 validated PHREEQC scenarios, requiring agents to construc

PHREEQC-MCQ-200: A Diagnostic Benchmark for Tool-Augmented Scientific Simulator Agents

본문 미리보기

관련 글

Constructive Alignment: Governing Preference Dynamics in Human-AI Interaction

Bounded Morality: Defining the Space of Moral Computation

The MMM Data Model -- A Normative Specification for Knowledge Interoperability in a Decentralisable Knowledge Commons

A Contextual-Bandit Oversight Game with Two-Sided Informational Asymmetry