Why LLMs Struggle with Arithmetic Puzzles

This article explores how large language models like GPT-4, Llama-2, and Deepseek-Coder perform on a challenging symbolic arithmetic puzzle benchmark. Despite extensive hyperparameter tuning with LoRA, AdamW, and cosine learning schedulers, even state-of-the-art models fail to generate correct solutions. The findings highlight the limitations of Chain-of-Thought prompting and emphasize the need for specialized fine-tuning on synthetic data to tackle symbolic reasoning tasks effectively.


This content originally appeared on HackerNoon and was authored by Extrapolate

:::info Authors:

(1) Haolong Li, Tongji Universiy and work done during internship at ByteDance (furlongli322@gmail.com);

(2) Yu Ma, Seed Foundation, ByteDance (mayu.1231@bytedance.com);

(3) Yinqi Zhang, East China Normal University and work done during internship at ByteDance (zhang.inch@gmail.com);

(4) Chen Ye (Corresponding Author), ESSC Lab, Tongji Universiy (yechen@tongji.edu.cn);

(5) Jie Chen, Seed Foundation, ByteDance and a Project Leader (chenjiexjtu@gmail.com).

:::

Abstract and 1 Introduction

2 Problem Definition

2.1 Arithmetical Puzzle Problem

2.2 Data Synthesizing

2.3 Dataset

3 Model

4 Experiments

4.1 Evaluation

4.2 Results

4.3 Case Studies

5 Conclusion and Acknowledgements

6 Limitations

7 Ethics Statement and References

\ A Appendix

A.1 Hyperparameter Settings

A.2 Evaluation of the Base Model

A.3 Case Study

A.4 Visualization of the Proposed Puzzle

A.1 Hyperparameter Settings

In the SFT stage, we follow common fine-tuning hyperparameter settings for our model. We set learning rate to 1e−4 and adopt the cosine learning rate scheduler. We use low-rank adaptation (LoRA) tuning with a rank of 5, α of 32, and dropout of 0.05. And we employ Adamw optimizer with β1 = 0.9, β2 = 0.95 and ϵ = 1e − 9. Eight NVIDIA A100-SXM4-80GB GPUs are used to train the model with a batch size of 50 and the maximum epoch set to 5. Detailed settings are listed in Table 3.

\ Table 3: Hyperparameter Settings.

\

A.2 Evaluation of the Base Model

We evaluate the base model (open-llama-3B) on the proposed arithmetical puzzle problem. As shown in Table 4 and Table 5, with either the few-shot prompting (2-Shot, 8-Shot) or Chain-of-Thought (CoT), the base model performs poorly on the puzzle. We propose this is due to the symbolic form of our prompt, the model needs to understand the underlying pattern in order to solve the arithmetical puzzle. Without fine-tuning on the synthetic data, the model may struggle to comprehend such type of prompt.

\ \ Table 4: Evaluation of the base model with few-shot and Chain-of-Thought prompting. As expected, the base model performs poorly across all the prompting techniques.

\ \ \ Table 5: An example of Chain-of-Thought prompting and the generated response of the base model.

\ \ We further test several open-source (Llama-2-7B (Touvron et al., 2023a), Deepseek-Coder-33B (Guo et al., 2024)) and closed-source models (GPT4 (Achiam et al., 2023)) with few-shot prompting. As shown in Table 6, these models also perform poorly on our benchmarks. In Table 7, we provide an example of the CoT prompting and the generated responses from these models.

\ \ Table 6: Evaluation results of Llama-2-7B, Deepseek-Coder-33B, and GPT4 on our proposed benchmarks.

\ \ \ Table 7: An example of few-shot prompting and the generated responses of GPT4, Llama-2-7B, and DeepseekCoder-33B. We provide the models with two examples before the puzzle. As shown, all of the models fail to solve the given problem. GPT4 seems to understand the requirement of the puzzle, while the other two fail.

\ \ As shown in Table 7, Llama-2-7B fails to understand the requirement of the puzzle and just outputs two meaningless equations. Deepseek-Coder-33B treats the second example in few-shot prompting as the puzzle, and repeats the same calculations three times. It seems that GPT4 has well understood the prompt and used all the candidate integers only once, the calculations within the generated response are all right, while the solution is wrong. Actually, such kind of problem is very challenging, as the model needs to infer the requirement of the puzzle from the provided examples and then figure out the correct solution.

\

A.3 Case Study

\ Figure 4: Cases from the form OOD test dataset. The correct steps are highlighted in green, while the incorrect steps in red. Generally speaking, performance of model fine-tuned with 1M training data is the worst.

\

A.4 Visualization of the Proposed Puzzle

\ Figure 5: Visualization of the proposed arithmetical puzzle. Given the candidate integers 3, 6, 7, 51, 58 and the target integer 4, the answer is 58 − 51 = 7, 6 − 7 = −1, 3 × (−1) = −3, −3 + 7 = 4.

\ \ \

:::info This paper is available on arxiv under CC BY-NC-SA 4.0 Deed (Attribution-Noncommercial-Sharelike 4.0 International) license.

:::

\


This content originally appeared on HackerNoon and was authored by Extrapolate


Print Share Comment Cite Upload Translate Updates
APA

Extrapolate | Sciencx (2025-08-23T16:36:12+00:00) Why LLMs Struggle with Arithmetic Puzzles. Retrieved from https://www.scien.cx/2025/08/23/why-llms-struggle-with-arithmetic-puzzles/

MLA
" » Why LLMs Struggle with Arithmetic Puzzles." Extrapolate | Sciencx - Saturday August 23, 2025, https://www.scien.cx/2025/08/23/why-llms-struggle-with-arithmetic-puzzles/
HARVARD
Extrapolate | Sciencx Saturday August 23, 2025 » Why LLMs Struggle with Arithmetic Puzzles., viewed ,<https://www.scien.cx/2025/08/23/why-llms-struggle-with-arithmetic-puzzles/>
VANCOUVER
Extrapolate | Sciencx - » Why LLMs Struggle with Arithmetic Puzzles. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/08/23/why-llms-struggle-with-arithmetic-puzzles/
CHICAGO
" » Why LLMs Struggle with Arithmetic Puzzles." Extrapolate | Sciencx - Accessed . https://www.scien.cx/2025/08/23/why-llms-struggle-with-arithmetic-puzzles/
IEEE
" » Why LLMs Struggle with Arithmetic Puzzles." Extrapolate | Sciencx [Online]. Available: https://www.scien.cx/2025/08/23/why-llms-struggle-with-arithmetic-puzzles/. [Accessed: ]
rf:citation
» Why LLMs Struggle with Arithmetic Puzzles | Extrapolate | Sciencx | https://www.scien.cx/2025/08/23/why-llms-struggle-with-arithmetic-puzzles/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.