Macrobius on Nostr: On a more serious note, what actually might be going on that is newsworthy: ...
On a more serious note, what actually might be going on that is newsworthy:
https://buttondown.com/ainews/archive/ainews-bespoke-stratos-sky-t1-the-vicunaalpaca/
In the ChatGPT heyday of 2022-23, Alpaca and Vicuna were born out of LMsys and Stanford as ultra cheap ($300) finetunes of LLaMA 1 that distilled from ChatGPT/Bard samples to achieve 90% of the quality of ChatGPT/GPT3.5.
In the last 48 hours, it seems the Berkeley/USC folks have done it again, this time with the reasoning models.
It's hard to believe this sequence of events happened just in the last 2 weeks:
....
While Bespoke's distillation does not quite match DeepSeek's distillation in performance, they used 17k samples vs DeepSeek's 800k. It is pretty evident that they could keep going here if they wished.
The bigger shocking thing is that "SFT is all you need" - no major architecture changes are required for reasoning to happen, just feed in more (validated, rephrased) reasoning traces, backtracking and pivoting and all, and it seems like it will generalize well. In all likelihood, this explains the relative efficiency of o1-mini and o3-mini vs their full size counterparts.
Rest at the link
- 30 -
https://buttondown.com/ainews/archive/ainews-bespoke-stratos-sky-t1-the-vicunaalpaca/
In the ChatGPT heyday of 2022-23, Alpaca and Vicuna were born out of LMsys and Stanford as ultra cheap ($300) finetunes of LLaMA 1 that distilled from ChatGPT/Bard samples to achieve 90% of the quality of ChatGPT/GPT3.5.
In the last 48 hours, it seems the Berkeley/USC folks have done it again, this time with the reasoning models.
It's hard to believe this sequence of events happened just in the last 2 weeks:
....
While Bespoke's distillation does not quite match DeepSeek's distillation in performance, they used 17k samples vs DeepSeek's 800k. It is pretty evident that they could keep going here if they wished.
The bigger shocking thing is that "SFT is all you need" - no major architecture changes are required for reasoning to happen, just feed in more (validated, rephrased) reasoning traces, backtracking and pivoting and all, and it seems like it will generalize well. In all likelihood, this explains the relative efficiency of o1-mini and o3-mini vs their full size counterparts.
Rest at the link
- 30 -