1. Key Idea:

Propose ReST-MCTS*:

2. Different rewards and values used in the paper

3. Self-training Algorithm

Screenshot 2024-10-29 at 1.54.47 PM.png

Step 1: Initialize policy/LLM and PRM.

Step 2: MCTS Execution with policy and value model.*

  1. Use the policy model to generate reasoning paths