arXiv:2606. 05384v1 Announce Type: new Abstract: LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumpti

#LLM 평가#견고성#조작 가능성

📰미디어arXiv cs.AI

원문

AI🧑‍💻개발자👥일반

1일 전

LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization

AI 수학자 신뢰성 향상

How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment

arXiv:2606. 05256v1 Announce Type: new Abstract: This study analyzes a publicly released dataset from a discontinued field experiment on Reddit's r/ChangeMyView. The intervention, conducted by unknown, external researchers and halted following ethical backlash, involved undisclosed AI-generated acco

📰미디어arXiv cs.AI

원문

How to Properly Use an AI Code Assistant

본문 미리보기

관련 글

Thousand Token Wood: shipping a multi-agent economy on a 3B model

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization

How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment