Introduce SORM: Basically a PRM but trained with synthetic data with no need of human-annotated data.
New way of inference: know when to correct with ORM, where to improve with SORM, and how to improve with global / local refinement.
How to train a SORM:
Data Generation for SORM: The goal is to generate labels for intermediate steps that can better approximate the optimal policy (π∗)—essentially, what an ideal reasoner would do:
Step-by-Step Evaluation: For each step Si in a model-generated solution, they use rejection sampling:
They sample the student model K times starting from the prefix of steps Pi=(S1,...,Si). These samples are called verifying traces T1,...,TK.
Each trace is checked to see if it eventually leads to the correct final answer.
Labeling: The labeling of each step Si is based on the sampled traces:
If at least one of these verifying traces leads to the correct final answer, the step Si is labeled as positive (correct).
If none of the traces lead to the correct final answer, the step Si is labeled as negative (incorrect).
Post-Processing of SORM Data:
Positive Propagation: If a step Si is labeled as positive, all preceding steps (Sj for j≤i) are also labeled as positive. This accounts for cases where the model can solve a problem starting from a certain step but not earlier ones.
Consistency Constraint: They enforce that intermediate results Ri computed at each step Si must be used later in the solution to avoid false positives. In practice, they implement this by checking that each result Ri appears in the suffix following Pi.
Balancing Labels: They balance the number of positive and negative labels for each prefix length to avoid bias. Without this step, there could be an imbalance where early steps tend to be labeled positive, and later steps tend to be labeled negative. This balancing prevents the SORM from simply predicting positive for early steps and negative for later ones.
Note: Section G: They checked false positives and false negatives → the result was ok.
Global & Local refinement
Global:
Input: The input to the global refinement model is the question Q and the initial draft solution A_D. It predicts a new, refined solution A_R that attempts to fix any errors in the draft.
Output: The output is a new, corrected solution A_R that potentially addresses the mistakes in the initial draft.
Local
Input: The input includes the question Q, the initial draft AD, and an extra indicator E specifying the location of the first mistake. This error location helps guide the model to refine only specific parts of the solution.
Output: The output is a refined solution AR that corrects the problem at the indicated location and continues from there.
Global Refinement: The global model learns to take an incorrect draft and generate a corrected version. During training, it is optimized to produce a solution that resembles the correct reference answer A_correct.
Local Refinement: The local model learns to use the error location E to focus its refinement on correcting specific parts of the solution. It is optimized to transform the initial draft into a version that avoids the identified mistake and continues toward the correct final answer.
Takeaways:
It never worked well without the intervention of the reward model? WHY? Why did the RISE paper work well but this didn’t? What was the difference??
SORM worked pretty well, implying that we probably won’t necessarily need human-annotated data for PRM.