What is Nostr?
YawningGoat /
npub133p…hy97
2025-01-05 22:02:21
in reply to nevent1q…kw55

YawningGoat on Nostr: Some thoughts on benchmarks if you find it useful :) ...

Some thoughts on benchmarks if you find it useful :)
LLM Benchmarks (in my rough order of preference):

General & Multi-category:
- Vibes (use it and see for yourself)
- https://livebench.ai/
- https://x.com/aidan_mclau/status/1857576189423935976?s=46 (check latest tweets)
- https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard (easily gamed)
- https://artificialanalysis.ai (easily gamed)

Coding:
- https://aider.chat/docs/leaderboards/
- https://livebench.ai/
- lmarena & Artificial Analysis (see above)

Reasoning -
- https://huggingface.co/spaces/allenai/ZeroEval
- https://arcprize.org/2024-results
- lmarena & Artificial Analysis (see above)

Other:
- https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard (Filter for source models only. Lots of benchmarks on Open LLMs, but this can be easily gamed.)
Author Public Key
npub133pshk4dcx3q9exaz8yxeqj5dwcs3464ud6t0yvps86n8w2wxyhq34hy97