link
Core Idea:
- Problem: Recent SOTA models tend to use tons of data for pre-training, and then also collect tons of data with human feedback for fine-tuning (RLHF). However, this requires a lot of human annotation as well as the computation power.
- Solution: Instead of scaling in quantity, scale in the quality - Use only 1000 rigorously-filtered dataset (including the dataset crafted by the authors themselves) to do the fine-tuning.
- Intuition: Superficial Alignment Hypothesis:
- A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users. If this hypothesis is correct, and alignment is largely about learning style, then a corollary of the Superficial Alignment Hypothesis is that one could sufficiently tune a pretrained language model with a rather small set of examples.
Results:
- Good results: human preference, GPT preference, and absolute grading


- Why less is more? Doubling the training set does not improve response quality. This result, alongside our other findings in this section, suggests that the scaling laws of alignment are not necessarily subject to quantity alone, but rather a function of prompt diversity while maintaining high quality responses.

-
Multi-turn dialogue: Improved significantly after fine-tuning.

Takeaway:

In this paper, the problem that they couldn’t do self-reasoning might be that their data quality was simply not good enough. What if we collect the data of reasoning by ourselves very thoroughly (like that o1 demo example)?
The main problem of it is that we should have a well-crafted human data, which is expensive in terms of cost, and challenging in terms of quality (how to put reasoning process in words??). However, all we need is like 100 dataset for this, right?