Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions | AIChainDay

🇰🇷 한국어 요약by Claude · 2026. 5. 27.

LLM이 수학 추론 벤치마크에서 높은 정확도를 보여도 단순 변형에 취약하다는 점을 체계적으로 검증했다. GSM-Symbolic 데이터셋 1,000개 문제에 대해 순수 추론(CoT), 단일 코드 실행(PAL), 반복 코드 실행(SBSC) 세 접근법을 Claude Haiku 4.5로 평가했다. CoT가 정확도 하락 1.3%p로 가장 견고했고, PAL이 1.7%p로 가장 취약했으며 차이는 통계적으로 유의미하지 않았다(p=.096). 코드 실행 방식이 초등 수준 문제 변형에 대한 추론 견고성을 개선하지 못한다는 결론은 현 LLM 수학 능력의 실제 한계를 드러낸다.

•GSM-Symbolic 1,000개 문제에서 원본-변형 쌍 비교. CoT 1.3%p 하락(3.1% 문제 붕괴), PAL 1.7%p(3.1% 붕괴)로 코드 실행이 견고성 개선 없음.
•세 방식 모두 방향성은 CoT > SBSC > PAL로 일관됐으나 통계적 유의미성 미달(p=.096).
•단순 이름·숫자 변경에도 성능이 하락한다는 사실은 LLM이 수학 원리를 진정으로 이해하지 못할 수 있음을 시사.
•Claude Haiku 4.5를 사용해 세 방식을 동일 조건에서 공정하게 비교하는 방법론 채택.

AI2026년 5월 27일AI 점수: 94%

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

출처:arXiv cs.AI

✨ AI 인사이트

🧑‍💻 개발자

1.GSM-Symbolic 1,000문제에서 CoT·PAL·SBSC 세 방법을 원본/변형 쌍으로 Claude Haiku 4.5로 평가
2.CoT가 정확도 낙폭 1.3%p로 가장 견고, PAL이 1.7%p로 가장 취약, SBSC는 그 중간
3.코드 실행(PAL·SBSC)이 초등 수준 수학 문제 변형에 대한 추론 견고성을 개선하지 못함을 실증
4.차이는 통계적 유의성 미달(p=0.096)이나 모든 지표에서 방향성 일관

💡

왜 중요한가?

코드 실행이 수학 추론 정확도를 높일 수 있어도 문제 변형 견고성은 개선하지 못한다는 발견은, LLM 기반 수학 튜터·시험 평가 시스템 방법 선택에 직접 영향을 준다.

본문 미리보기

arXiv:2605.26414v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like different names or numbers. Code execution methods, which let models generate and run Python code instead of reasoning in natural language, have been proposed as a solution, but their effect on reasoning robustness (the ability to maintain accuracy across problem variations)

전체 내용이 궁금하다면?

원문을 직접 읽어보세요

원문 보기

#LLM 수학 추론#코드 실행#강건성 평가#문제 변형

8시간 전

Thousand Token Wood: shipping a multi-agent economy on a 3B model

🏢공식HuggingFace Blog

원문

1일 전

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

arXiv:2606. 05384v1 Announce Type: new Abstract: LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumpti

📰미디어arXiv cs.AI

원문

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

본문 미리보기

관련 글

Thousand Token Wood: shipping a multi-agent economy on a 3B model

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)

SentinelBench: A Benchmark for Long-Running Monitoring Agents