lhl on Nostr: #ai Preface: I think Gary Marcus contracted a bad case of engagement-maxing ...
#ai Preface: I think Gary Marcus contracted a bad case of engagement-maxing brainworms a while back, and a lot of his AI takes have gotten worse and worse (eg, his latest outrage about training on train). Still, I, someone pointed out that he did OK on a set of predictions he made back in March 2024 so I figured I’d check the list:
✅ 7-10 GPT-4 level models
I think he gets this broadly right. Even ignoring OpenAI’s new models (o1, o1-mini, gpt-4o) we have: Anthropic Claude 3.5 Sonnet, Google Gemini 1.5 Pro/2.0 Flash/2.0 Pro, Meta Llama 3.1 405B/3.3 70B, Qwen 2.5 72B/QwQ, DeepSeek V2.5, 01.ai Yi Lightning, Mistral Large 2, xAI Grok 2, and Amazon Nova Pro - all are solidly GPT-4 class (or better!) and one could argue that this is table stakes now.
❌ No massive advance (no GPT-5, or disappointing GPT-5)
I think o1 (and even Claude Sonnet 3.5) are a meaningful leap vs og GPT-4 for hard/reasoning tasks, but the o3 announcement/benchmarks are out of this world (not just ARC-AGI, but GPQA Diamond, FrontierMath, etc - it’s able to saturate almost every single traditional eval). Test-time compute is massive, and we’re seeing adoption across the board (Deepseek R1, Qwen QwQ, Gemini Flash Thinking, etc)
Multimodal omni models also now … exist. This is again, something that simply didn’t exist last year. Vision, Audio, Speech, Video, Images, it’s all here, and all quickly getting better
Even beyond that, it’s important to remember that gpt-4 launched with an 8K (3.5 was 4K!) token context. Table stakes now are 32-128K and Gemini offers up to 2M tokens for everyone in AI Studio for free. This is something that would have been mind-boggling just 1 year ago and alone is a massive advance
What happens when you start combining these? If you want a glimpse of the (near) future, check out Google’s Multimodal Live - the future will be always-on, contextually aware copilots/assistants, and as we’re seeing with the just released QvQ, of course they will be able to think extensively about things. -To me, the improvements from Anthropic’s Artifacts/OpenAI’s Canvas, and Google’s NotebookLM also are pretty big leaps as well. They are on the system/product side, but point at how much interesting/low hanging fruit is available for actually applying/using these models
Lastly, coding has also taking a huge leap. gpt-4-0314 scores a 66.2% (vs gpt-3.5-turbo-0301 at 57.9%) on the Aider Leaderboard. claude-3.5-sonnet-202141022 scores 84.2%. On LiveCodeBench o1-mini-2024-09-12 scores Pass@1 67.2, over double gpt-4-0613‘s scores of 32.5. On SWE-bench - SWE-agent+GPT4 (1106) scored 23.2%, while the top verified model atm, Amazon Q Developer Agent scores 55%.
Fun fact that turned up in search: Gary Marcus speculated GPT-4 Turbo was GPT-5. Yes, a clearly smaller/faster model. Sadly, brainworms.
½ Price wars
I’ll give half a point here. While prices have dropped due to stiff competition and more efficient distillation/quanting/inference efficiency has given us better performing models at lower prices, they’ve come down incrementally/at regular intervals, not at a frenzy that would characterize a true price war IMO.
8K GPT 4 launched at 3:1 blended token price of $37.50/million, while GPT-4o is now $4.40/M. Gemini 1.5 Pro is half that at $2.20/M, Amazon Nova Pro is $1.40/M, and Llama 3.3 70B is as cheap as $0.20-$0.27/M (Nebius, Deepinfra). DeepSeek v2.5 is at $0.175/M (or significantly less for a cached hit).
Note, there has been a pretty cut-throat (loss-leading) price war on two fronts: in China, and for open models inference, but those are part of why I only give a half-point, since they contrast at the absolute frontier, where prices basically remain the same as they were in 2023 (o1-preview is $26.25/M blended, Claude 3 Opus if you still need it is $30/M).
½ Very little moat for anyone
On the one hand, it’s not winner take all. The big players have all been able to show that they can be fiercely competitive, and a bunch of scrappier (and open source) players are competing well too (AI2 has just released class-leading completely open (weights, code, data) models for example (see also Tulu 3), and multiple open source groups have reasoning models dropping).
That being said, we’re seeing that to get market traction, or to stay on the leading edge is pretty intense. The moat is in the traditional areas of capital, distribution, product development, execution.
Also gobs of compute. Gobs.
… Running long, continuing…
✅ 7-10 GPT-4 level models
I think he gets this broadly right. Even ignoring OpenAI’s new models (o1, o1-mini, gpt-4o) we have: Anthropic Claude 3.5 Sonnet, Google Gemini 1.5 Pro/2.0 Flash/2.0 Pro, Meta Llama 3.1 405B/3.3 70B, Qwen 2.5 72B/QwQ, DeepSeek V2.5, 01.ai Yi Lightning, Mistral Large 2, xAI Grok 2, and Amazon Nova Pro - all are solidly GPT-4 class (or better!) and one could argue that this is table stakes now.
❌ No massive advance (no GPT-5, or disappointing GPT-5)
I think o1 (and even Claude Sonnet 3.5) are a meaningful leap vs og GPT-4 for hard/reasoning tasks, but the o3 announcement/benchmarks are out of this world (not just ARC-AGI, but GPQA Diamond, FrontierMath, etc - it’s able to saturate almost every single traditional eval). Test-time compute is massive, and we’re seeing adoption across the board (Deepseek R1, Qwen QwQ, Gemini Flash Thinking, etc)
Multimodal omni models also now … exist. This is again, something that simply didn’t exist last year. Vision, Audio, Speech, Video, Images, it’s all here, and all quickly getting better
Even beyond that, it’s important to remember that gpt-4 launched with an 8K (3.5 was 4K!) token context. Table stakes now are 32-128K and Gemini offers up to 2M tokens for everyone in AI Studio for free. This is something that would have been mind-boggling just 1 year ago and alone is a massive advance
What happens when you start combining these? If you want a glimpse of the (near) future, check out Google’s Multimodal Live - the future will be always-on, contextually aware copilots/assistants, and as we’re seeing with the just released QvQ, of course they will be able to think extensively about things. -To me, the improvements from Anthropic’s Artifacts/OpenAI’s Canvas, and Google’s NotebookLM also are pretty big leaps as well. They are on the system/product side, but point at how much interesting/low hanging fruit is available for actually applying/using these models
Lastly, coding has also taking a huge leap. gpt-4-0314 scores a 66.2% (vs gpt-3.5-turbo-0301 at 57.9%) on the Aider Leaderboard. claude-3.5-sonnet-202141022 scores 84.2%. On LiveCodeBench o1-mini-2024-09-12 scores Pass@1 67.2, over double gpt-4-0613‘s scores of 32.5. On SWE-bench - SWE-agent+GPT4 (1106) scored 23.2%, while the top verified model atm, Amazon Q Developer Agent scores 55%.
Fun fact that turned up in search: Gary Marcus speculated GPT-4 Turbo was GPT-5. Yes, a clearly smaller/faster model. Sadly, brainworms.
½ Price wars
I’ll give half a point here. While prices have dropped due to stiff competition and more efficient distillation/quanting/inference efficiency has given us better performing models at lower prices, they’ve come down incrementally/at regular intervals, not at a frenzy that would characterize a true price war IMO.
8K GPT 4 launched at 3:1 blended token price of $37.50/million, while GPT-4o is now $4.40/M. Gemini 1.5 Pro is half that at $2.20/M, Amazon Nova Pro is $1.40/M, and Llama 3.3 70B is as cheap as $0.20-$0.27/M (Nebius, Deepinfra). DeepSeek v2.5 is at $0.175/M (or significantly less for a cached hit).
Note, there has been a pretty cut-throat (loss-leading) price war on two fronts: in China, and for open models inference, but those are part of why I only give a half-point, since they contrast at the absolute frontier, where prices basically remain the same as they were in 2023 (o1-preview is $26.25/M blended, Claude 3 Opus if you still need it is $30/M).
½ Very little moat for anyone
On the one hand, it’s not winner take all. The big players have all been able to show that they can be fiercely competitive, and a bunch of scrappier (and open source) players are competing well too (AI2 has just released class-leading completely open (weights, code, data) models for example (see also Tulu 3), and multiple open source groups have reasoning models dropping).
That being said, we’re seeing that to get market traction, or to stay on the leading edge is pretty intense. The moat is in the traditional areas of capital, distribution, product development, execution.
Also gobs of compute. Gobs.
… Running long, continuing…