🟢 Training Language Models for Self-Correct via Reinforcement Learning

How do current approaches instill self-correction?

Prompt-engineering: Generally doesn’t work.
Fine-tuning: Require multiple models, like separate verifier or ORM/PRM reward models.

This paper requires only one model and intrinsically corrects errors without external feedback.

Problems with Supervised Fine Tuning (SFT)

SFT approaches that fine-tune on some data collected from a base model scale well on single-turn reasoning problems.

These methods improve self-correction over the base model, but they fail to achieve substantially positive self-correction.

Thought: But their goal isn’t self-correction? What if fine-tuned base model gives good enough accuracy already?

Our approach of within turn might even work better. We train on good data, but also when an LLM eventually goes to a wrong direction, we might be able to steer it towards the right answer.

Current attempts that fine-tuned an LLM to do self-correction fail in these 3 aspects.

Collapse / Minor edit: Stay at the initial response and make minor changes only.
- Mainly visible in the + datasets which include correct-correct trajectories.
- The expectation was to reduce wrongly changing correct to incorrect.
- Concrete explanation not provided. Seems that SFT induces behavior of not wanting to change the response at all.
  - Training on Pair-SFT has little benefit to self-correction. Accuracies on both turns are similar. SFT is just learning to give the best output. Maybe the diversity in incorrect-correct pairs is just helping it to produce good responses. (Correct responses produced in the first attempt can also be used to make a multi-turn incorrect-corect pair.) So, it seems to be learning to make good answers no matter the turn?
  - Training on STaR changes correct to incorrect. It doesn’t have clear understanding of when to make modifications. Because it’s not trained to do that! Training on STaR+ which has more correct-correct pairs seems to prevent the model from erroneously revising a correct response, but no major self-correction improvement. But it also biases the model to not change the answer too much.
- Looking into edit distances between two responses, SFT models tend to prefer not changing the responses too much.
  - STaR produces similar edit distance ratios on both train and validation sets, but for Pair-SFT it doesn’t like to make too many changes. Meaning that Pair-SFT is not effective at generalizing to new problems from the same distribution. (for questions that require major changes, it won’t change it too much)
Lack of independence: Some need external knowledge sources: like verifier or outcome/process reward models.
Distributional shift: Training dataset that is not from the LLM itself, causing distributional shift.