1. Introduction / Problems to solve

Multiple papers tried reasoning with different models (like ORM PRM), but we want to achieve the model that does multi-turn reasoning by itself.
One way to do is to do SFT. However, this doesn’t work that well because:
- Data not from the learner → Won’t have the data of the mistake done by itself (not online RL, or not bootstrapping)
- The trajectories might not contain the correction process that are actually meaningful (feel like this said nothing)

2. RISE methodology

Screenshot 2024-10-04 at 3.06.33 PM.png

Basic concept:
- x: question
- y_i: ith answer
- f_i: feedback (”The answer is wrong. Try again”)
- trajectory: x→y1→f1→y2→f2→…

Continuous bootstrapping.
Note that there are two different ways of distillation.
1. Distillation: Use a more powerful model
2. Self-distillation: Ensemble model

Screenshot 2024-10-04 at 2.53.50 PM.png

Target function is shown below. It is called reward weighted regression

Screenshot 2024-10-04 at 3.14.19 PM.png

Screenshot 2024-10-04 at 2.54.24 PM.png

Two types of inference”
1. With oracle - using an ORM to check validity (not self-improving btw..)
2. Without oracle - majority vote