https://arxiv.org/pdf/2410.05229
Key idea:
- Does LLM really know how to do formal reasoning or is it just a pattern-matching?
- The paper therefore makes several different variations of GSM8K, a widely used math benchmark.
Variations:
-
GSM-Symbolic: Change the names and numbers using the template

-
GSM8K with different difficulties

- GSM-Symbolic-M1: one fewer clause
- GSM-Symbolic-P1: one more clause
- GSM-Symbolic-P2: two more clauses
-
GSM-NoOp: Adding a clause that looks relevant but actually irrelevant.

Results:
GSM-Symbolic


- The performances of all the models dropped.
- No big change when only names are changed. However, performance dropped significantly when the numbers changed.
MY CONCERN:
- LLMs are innately bad at arithmetic. What we should check is if it is making calculation mistake or logical mistake.
- I guess the mistakes they committed are mostly calculation mistakes, as the performance never changed that much when only the names are changed.