🟡 Refiner: Reasoning Feedback on Intermediate Representations

Main idea:

Screenshot 2024-10-08 at 11.32.02 AM.png

Train a critique model that works in stepwise level (like PRM but in natural language).
During the inference, the generator obtains feedback from the critic for every single step.

Training process:

Screenshot 2024-10-08 at 11.32.38 AM.png

You train the critic model. Again, note that the critic model gives natural language feedback instead of the score.
You train the generator model based using the critic model (SFT).
1. I don’t get this part though.. There is no reward to maximize but simply do token-by-token supervised learning? What is the point? I will come back later if needed, but I’m just quickly skimming through.

Error & Feedback Types

Screenshot 2024-10-08 at 11.33.16 AM.png