🟡 Generating Sequences by Learning to Self-Correct

Main idea (By Minwu):

Separate generator and corrector. This is beneficial since:
- For generator, you can use a stronger model. This allows you to exploit its strong ability without the need of fine-tuning to any specific tasks.
- For corrector, you can use a smaller model. This allows you to nimbly fine-tune the SLM for specific tasks with lower cost.
In other words, when initial response is good, a certain distribution is shifted towards the proper answer, so it needs relatively small cost to correct it.

Inference:

Screenshot 2024-10-07 at 4.42.35 PM.png

Training (the corrector model):

Overview

Screenshot 2024-10-07 at 4.43.10 PM.png

Algorithm:

Screenshot 2024-10-07 at 4.43.27 PM.png

Loss function:

Screenshot 2024-10-07 at 4.46.21 PM.png

Brief Insights (By Safal)

Train a corrector using an online training procedure that uses scalar or natural feedback on intermediate imperfect generation. Assumes access to a reward function or ground truth which is not always available.
First you generate an answer with the generator. Then, the corrector takes over and continually tries to improve the answer. So, will the output be limited to the corrector’s abilities although they mention that corrector could be a smaller model? But performance seems pretty good according to their results.
They say that the trained corrector model can even be applied to a larger generator with similar performance to training a new corrector. Does this relate to the distribution shift problem in any way as mentioned in the Google paper?
- (By Minwu) Minwu’s Thought: My understanding is that the distribution shift problem is essentially whether the training data contains the kinds of mistakes that the generator is likely to make. DeepMind’s paper mentioned that they tried to fine-tune Gemini using OpenAI's dataset, PRM800k, but it didn’t work well because that dataset contained few of the kinds of mistakes that Gemini might make. However, in this paper, the corrector iteratively corrects the mistakes made by the generator. In this system, the generator and corrector are always connected, meaning the corrector is fine-tuned to address the specific mistakes made by the generator. Therefore, I think this paper doesn’t encounter that issue.