1. Key Idea:

Propose ReST-MCTS*:

MCTS*: Performs a tree search with sufficient rollout time under the guidance of the PRM.
PRM: evaluates any partial solution quality and guides MCTS.
Policy Model: LLM which generates multiple intermediate reasoning steps for each question.
LLM Self-Training: Uses MCTS* to collect reasoning traces, trains policy model on positive samples (SFT), and trains process reward model on all generated traces.

2. Different rewards and values used in the paper

m_k: reasoning distance
- estimated by doing multiple traces and finding the actual minimum steps used to discover the correct answer.
V_k: The quality value associated with a partial solution at step k.
- a CUMULATIVE reasoning quality.
r_sk: the process reward captured by the PRM.
- assess individual step correctness.
w_sk: the weighted reward for the k_th step.
- integrates both the quality value from the previous steps and the predicted correctness of the current step.
- designed to capture the contribution of that step to the overall reasoning process while incorporating the reasoning distance.
Haven’t understood correctly… but Wsk could be better understood as the exploration term… I guess?

Screenshot 2024-10-29 at 1.54.47 PM.png

Step 1: Initialize policy/LLM and PRM.

Step 2: MCTS Execution with policy and value model.*

Use the policy model to generate reasoning paths
- MCTS∗ utilizes the policy model to expand nodes, representing reasoning steps along possible solution paths.