link

1. Main takeaways:

2. Model training

Screenshot 2024-10-07 at 1.38.51 PM.png

2.1. data collection

  1. For each question, generate answer in the CoT manner.

  2. Collect the ones that generated correct answer and incorrect answer, and craft cartesian product between the sets of incorrect and correct solutions.

  3. For each pair, correct solution work as a hint to critique incorrect solution.

  4. Generate the critique (only one iteration in this paper btw.)

    Screenshot 2024-10-07 at 1.47.46 PM.png

2.2. data filtration

  1. Rule-based filtering
    1. number of steps and feedbacks should be the same
    2. each step should be exactly copied from the initial solution
    3. The feedback for the last step should provide the correct answer
  2. Prompt-based filtering
    1. Further check if the correction led to the correct answer (Didn’t fully grasp yet…)

2.3. fine-tuning

  1. Refiner fine-tuning