1. Main takeaways:

Do SFT with the data that involves self-verification and self-refinement (this involves the rationale “WHY”) to enhance the reasoning ability.
Result shows that the bottleneck of the reasoning is verification, that is, to judge if the answer is wrong and in which part of the answer is wrong, rather than refinement.

2. Model training

Screenshot 2024-10-07 at 1.38.51 PM.png

2.1. data collection

For each question, generate answer in the CoT manner.
Collect the ones that generated correct answer and incorrect answer, and craft cartesian product between the sets of incorrect and correct solutions.
For each pair, correct solution work as a hint to critique incorrect solution.
Generate the critique (only one iteration in this paper btw.)

2.2. data filtration

Rule-based filtering
1. number of steps and feedbacks should be the same
2. each step should be exactly copied from the initial solution
3. The feedback for the last step should provide the correct answer
Prompt-based filtering
1. Further check if the correction led to the correct answer (Didn’t fully grasp yet…)

2.3. fine-tuning

Refiner fine-tuning

Do standard SFT (cross-entropy based) of the refiner (generator)