Dave Rahardja on Nostr: Finally got around to reading the #Apple #LLM paper that’s been going around. ...
Finally got around to reading the #Apple #LLM paper that’s been going around. tl;dr: LLMs can’t do math because they don’t actually understand concepts; they are just really fancy autocomplete engines.
We knew that already, but this paper quantifies it. The math performance is really pretty dismal even with training that tries to optimize for math. The best performance was by OpenAI’s GPT-4o, which scored around 95% for the most basic of grade-school word problems, which means it got 1 in 20 questions wrong, which means it’s not usable for anything in production.
Adding (relevant or irrelevant) clauses or even changing proper names can cause model performance to rapidly collapse.
#ai
https://arxiv.org/pdf/2410.05229
We knew that already, but this paper quantifies it. The math performance is really pretty dismal even with training that tries to optimize for math. The best performance was by OpenAI’s GPT-4o, which scored around 95% for the most basic of grade-school word problems, which means it got 1 in 20 questions wrong, which means it’s not usable for anything in production.
Adding (relevant or irrelevant) clauses or even changing proper names can cause model performance to rapidly collapse.
#ai
https://arxiv.org/pdf/2410.05229