arxiv link

How do current approaches instill self-correction?

This paper requires only one model and intrinsically corrects errors without external feedback.

Problems with Supervised Fine Tuning (SFT)

SFT approaches that fine-tune on some data collected from a base model scale well on single-turn reasoning problems.

These methods improve self-correction over the base model, but they fail to achieve substantially positive self-correction.

Thought: But their goal isn’t self-correction? What if fine-tuned base model gives good enough accuracy already?

Our approach of within turn might even work better. We train on good data, but also when an LLM eventually goes to a wrong direction, we might be able to steer it towards the right answer.

Current attempts that fine-tuned an LLM to do self-correction fail in these 3 aspects.

image.png