Main idea:

- Train a critique model that works in stepwise level (like PRM but in natural language).
- During the inference, the generator obtains feedback from the critic for every single step.
Training process:

- You train the critic model. Again, note that the critic model gives natural language feedback instead of the score.
- You train the generator model based using the critic model (SFT).
- I don’t get this part though.. There is no reward to maximize but simply do token-by-token supervised learning? What is the point? I will come back later if needed, but I’m just quickly skimming through.
Error & Feedback Types
