What is Nostr?
AtlantisPleb / Christopher David
npub1tlv…7fdm
2024-08-13 20:59:32

AtlantisPleb on Nostr: Episode 120: Exploring SWE-bench Verified We talk smack about benchmarks but conclude ...

Episode 120: Exploring SWE-bench Verified

We talk smack about benchmarks but conclude they may finally be worth our time.

We do a dramatic reading of OpenAI's blog post then feed it to OpenAgents which sets up a new repo as benchmark workspace.

We're going for the high score!

https://stacker.news/items/647686/r/AtlantisPleb
Author Public Key
npub1tlv67m7xvlyplzexuynmfpguvyet0sjffce3y8vu0suuyuwgzauqjk7fdm