npub15s…e8xgu on Nostr: o3 isn't as good as I hoped, but it's still an increment in the SOTA. 69% on ...
o3 isn't as good as I hoped, but it's still an increment in the SOTA.
69% on SWE-Bench Verified! The regression line over the past 2 years still points to 100‰ by year end!
Frankly I think the real story is how cheaply Gemini 2.5 is delivering 64% on SWE-Bench
Exciting times! Coding with Gemini 2.5 is so satisfying, a big step up from deepseek V3.1, which is what I was using before.
#ai #llm #o3
69% on SWE-Bench Verified! The regression line over the past 2 years still points to 100‰ by year end!
Frankly I think the real story is how cheaply Gemini 2.5 is delivering 64% on SWE-Bench
Exciting times! Coding with Gemini 2.5 is so satisfying, a big step up from deepseek V3.1, which is what I was using before.
#ai #llm #o3