Scaling Trends for Lie Detector Oversight in Preference Learning

본문 미리보기

arXiv:2607.01567v1 Announce Type: new Abstract: Deceptive behavior in LLMs is costly to monitor and prevent, motivating approaches such as Scalable Oversight via Lie Detectors (SOLiD) (Cundy & Gleave, 2025), which uses lie detectors to identify responses for review by high-cost labelers. In this paper, we scale SOLiD to larger models and evaluate it in more diverse and realistic preference-learning settings. We find favorable scaling: undetected deception drops from 34% for 1B-parameter models

Scaling Trends for Lie Detector Oversight in Preference Learning

본문 미리보기

관련 글

Profit-Based Counterfactual Explanations for Product Improvement: A Case Study of Manga Sales in Japan

Hawk: Harnessing Hardware-Aware Knowledge for High-Performance NPU Kernel Generation

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

Revisiting Chain-of-Thought Reasoning under Limited Supervision: Semi-supervised Chain-of-Thought Learning