1. Key Idea:
Propose ReST-MCTS*:
- MCTS*: Performs a tree search with sufficient rollout time under the guidance of the PRM.
- PRM: evaluates any partial solution quality and guides MCTS.
- Policy Model: LLM which generates multiple intermediate reasoning steps for each question.
- LLM Self-Training: Uses MCTS* to collect reasoning traces, trains policy model on positive samples (SFT), and trains process reward model on all generated traces.
2. Different rewards and values used in the paper
-
m_k: reasoning distance
- estimated by doing multiple traces and finding the actual minimum steps used to discover the correct answer.
-
V_k: The quality value associated with a partial solution at step k.

- a CUMULATIVE reasoning quality.
-
r_sk: the process reward captured by the PRM.
- assess individual step correctness.
-
w_sk: the weighted reward for the k_th step.
-
integrates both the quality value from the previous steps and the predicted correctness of the current step.
-
designed to capture the contribution of that step to the overall reasoning process while incorporating the reasoning distance.

-
Haven’t understood correctly… but Wsk could be better understood as the exploration term… I guess?

3. Self-training Algorithm

Step 1: Initialize policy/LLM and PRM.
Step 2: MCTS Execution with policy and value model.*
- Use the policy model to generate reasoning paths
- MCTS∗ utilizes the policy model to expand nodes, representing reasoning steps along possible solution paths.