🟢 GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. | Notion

https://arxiv.org/pdf/2410.05229

Key idea:

Does LLM really know how to do formal reasoning or is it just a pattern-matching?
The paper therefore makes several different variations of GSM8K, a widely used math benchmark.

Variations:

GSM-Symbolic: Change the names and numbers using the template
GSM8K with different difficulties
- GSM-Symbolic-M1: one fewer clause
- GSM-Symbolic-P1: one more clause
- GSM-Symbolic-P2: two more clauses
GSM-NoOp: Adding a clause that looks relevant but actually irrelevant.

Results:

GSM-Symbolic

Screenshot 2024-10-28 at 10.57.05 PM.png

Screenshot 2024-10-28 at 10.31.10 PM.png

The performances of all the models dropped.
No big change when only names are changed. However, performance dropped significantly when the numbers changed.

MY CONCERN:

LLMs are innately bad at arithmetic. What we should check is if it is making calculation mistake or logical mistake.
I guess the mistakes they committed are mostly calculation mistakes, as the performance never changed that much when only the names are changed.