theHigherGeometer on Nostr: npub16t62r…lhgu7 "We evaluated six leading language models on our existing subset ...
npub16t62rkttt6aduudqya89lvfallx59f4g6fltmqdhhr9jt3jw6s5q3lhgu7 (npub16t6…hgu7) "We evaluated six leading language models on our existing subset of FrontierMath problems: o1-preview (OpenAI 2024b), o1-mini (OpenAI 2024d), and GPT-4o (2024-08-06 version) (OpenAI 2024a), Claude 3.5 Sonnet (2024-10-22 version) (Anthropic 2024b), Grok 2 Beta (XAI 2024), and Google DeepMind’s Gemini 1.5 Pro 002 (GoogleAI 2024). All models had a very low performance on FrontierMath problems, with no model achieving even a 2% success rate on the full benchmark"
he he he.
he he he.