YawningGoat on Nostr: Some thoughts on benchmarks if you find it useful :) ...
Some thoughts on benchmarks if you find it useful :)
quoting note19u9…lnspLLM Benchmarks (in my rough order of preference):
General & Multi-category:
- Vibes (use it and see for yourself)
- https://livebench.ai/
- https://x.com/aidan_mclau/status/1857576189423935976?s=46 (check latest tweets)
- https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard (easily gamed)
- https://artificialanalysis.ai (easily gamed)
Coding:
- https://aider.chat/docs/leaderboards/
- https://livebench.ai/
- lmarena & Artificial Analysis (see above)
Reasoning -
- https://huggingface.co/spaces/allenai/ZeroEval
- https://arcprize.org/2024-results
- lmarena & Artificial Analysis (see above)
Other:
- https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard (Filter for source models only. Lots of benchmarks on Open LLMs, but this can be easily gamed.)