link

Main contributions:

  1. Introduce SORM: Basically a PRM but trained with synthetic data with no need of human-annotated data.

  2. New way of inference: know when to correct with ORM, where to improve with SORM, and how to improve with global / local refinement.

    Screenshot 2024-10-03 at 9.04.31 PM.png

How to train a SORM:

Screenshot 2024-10-03 at 9.04.31 PM.png

Global & Local refinement

Screenshot 2024-10-03 at 9.04.31 PM.png

Takeaways:

  1. It never worked well without the intervention of the reward model? WHY? Why did the RISE paper work well but this didn’t? What was the difference??
  2. SORM worked pretty well, implying that we probably won’t necessarily need human-annotated data for PRM.