AtlantisPleb on Nostr: Episode 120: Exploring SWE-bench Verified We talk smack about benchmarks but conclude ...
Episode 120: Exploring SWE-bench Verified
We talk smack about benchmarks but conclude they may finally be worth our time.
We do a dramatic reading of OpenAI's blog post then feed it to OpenAgents which sets up a new repo as benchmark workspace.
We're going for the high score!
https://stacker.news/items/647686/r/AtlantisPleb
We talk smack about benchmarks but conclude they may finally be worth our time.
We do a dramatic reading of OpenAI's blog post then feed it to OpenAgents which sets up a new repo as benchmark workspace.
We're going for the high score!
https://stacker.news/items/647686/r/AtlantisPleb