This paper requires only one model and intrinsically corrects errors without external feedback.
SFT approaches that fine-tune on some data collected from a base model scale well on single-turn reasoning problems.
These methods improve self-correction over the base model, but they fail to achieve substantially positive self-correction.
Thought: But their goal isn’t self-correction? What if fine-tuned base model gives good enough accuracy already?
Our approach of within turn might even work better. We train on good data, but also when an LLM eventually goes to a wrong direction, we might be able to steer it towards the right answer.
Current attempts that fine-tuned an LLM to do self-correction fail in these 3 aspects.
Collapse / Minor edit: Stay at the initial response and make minor changes only.
Mainly visible in the + datasets which include correct-correct trajectories.
The expectation was to reduce wrongly changing correct to incorrect.
Concrete explanation not provided. Seems that SFT induces behavior of not wanting to change the response at all.

Looking into edit distances between two responses, SFT models tend to prefer not changing the responses too much.

Lack of independence: Some need external knowledge sources: like verifier or outcome/process reward models.
Distributional shift: Training dataset that is not from the LLM itself, causing distributional shift.
