link

Core Idea:

Results:

  1. Good results: human preference, GPT preference, and absolute grading

Screenshot 2024-10-11 at 3.17.02 PM.png

Screenshot 2024-10-11 at 3.17.29 PM.png

  1. Why less is more? Doubling the training set does not improve response quality. This result, alongside our other findings in this section, suggests that the scaling laws of alignment are not necessarily subject to quantity alone, but rather a function of prompt diversity while maintaining high quality responses.

Screenshot 2024-10-11 at 3.18.57 PM.png

  1. Multi-turn dialogue: Improved significantly after fine-tuning.

    Screenshot 2024-10-11 at 3.22.46 PM.png

Takeaway:

Screenshot 2024-10-07 at 1.47.46 PM.png

In this paper, the problem that they couldn’t do self-reasoning might be that their data quality was simply not good enough. What if we collect the data of reasoning by ourselves very thoroughly (like that o1 demo example)?

The main problem of it is that we should have a well-crafted human data, which is expensive in terms of cost, and challenging in terms of quality (how to put reasoning process in words??). However, all we need is like 100 dataset for this, right?