Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models
본문 미리보기
arXiv:2606.25519v1 Announce Type: new Abstract: Quantization is widely used to reduce the inference cost of large language models, but its effect on reasoning models is not fully captured by final-answer accuracy or per-token latency. We show that low-bit post-training quantization can introduce a hidden test-time compute cost: quantized reasoning models often generate longer chains of thought even when they still answer correctly. Across mathematical reasoning, code generation, scientific ques
전체 내용이 궁금하다면?
원문을 직접 읽어보세요