Main idea:

Screenshot 2024-10-08 at 11.32.02 AM.png

Training process:

Screenshot 2024-10-08 at 11.32.38 AM.png

  1. You train the critic model. Again, note that the critic model gives natural language feedback instead of the score.
  2. You train the generator model based using the critic model (SFT).
    1. I don’t get this part though.. There is no reward to maximize but simply do token-by-token supervised learning? What is the point? I will come back later if needed, but I’m just quickly skimming through.

Error & Feedback Types

Screenshot 2024-10-08 at 11.33.16 AM.png