link
1. Introduction / Problems to solve
- Multiple papers tried reasoning with different models (like ORM PRM), but we want to achieve the model that does multi-turn reasoning by itself.
- One way to do is to do SFT. However, this doesn’t work that well because:
- Data not from the learner → Won’t have the data of the mistake done by itself (not online RL, or not bootstrapping)
- The trajectories might not contain the correction process that are actually meaningful (feel like this said nothing)
2. RISE methodology
2.1. Overview:

- Basic concept:
- x: question
- y_i: ith answer
- f_i: feedback (”The answer is wrong. Try again”)
- trajectory: x→y1→f1→y2→f2→…
2.2. Data collection:
- Continuous bootstrapping.
- Note that there are two different ways of distillation.
- Distillation: Use a more powerful model
- Self-distillation: Ensemble model

2.3. Training
Target function is shown below. It is called reward weighted regression

2.4. Inference

- Two types of inference”
- With oracle - using an ORM to check validity (not self-improving btw..)
- Without oracle - majority vote