Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models

본문 미리보기

arXiv:2606.25519v1 Announce Type: new Abstract: Quantization is widely used to reduce the inference cost of large language models, but its effect on reasoning models is not fully captured by final-answer accuracy or per-token latency. We show that low-bit post-training quantization can introduce a hidden test-time compute cost: quantized reasoning models often generate longer chains of thought even when they still answer correctly. Across mathematical reasoning, code generation, scientific ques

Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models

본문 미리보기

관련 글

The Hitchhiker's Guide to Agentic AI: From Foundations to Systems

Project Auto-World: Towards Automated Benchmarking of Neural Relational Reasoners

Diagnosing and Mitigating Compounding Failures in Agentic Persuasion via Taxonomic Strategy Retrieval

Do vision-language models search like humans? Reasoning tokens as a reaction-time analog in classic visual-search paradigms