INSIDE

Apple Study Reveals Critical Flaws in AI Logic

Hartley Charlton

Apple’s AI research team has found significant shortcomings in the reasoning abilities of large language models, according to a newly published study.


The study, published on arXiv, outlines Apple’s evaluation of a number of leading language models, including models from OpenAI, Meta, and other well-known developers, to determine how well these models can handle mathematical reasoning tasks. The results show that even minor changes in the wording of questions can cause large discrepancies in the models’ performance, which could undermine their reliability in scenarios that require logical consistency.

Apple highlights a persistent problem with language models: their reliance on pattern matching rather than genuine logical reasoning. In several tests, the researchers showed that adding irrelevant information to a question — details that shouldn’t affect the mathematical outcome — could cause the models to answer completely differently.

One example given in the paper involves a simple math problem about how many kiwis a person picked over a period of days. When irrelevant data about the size of some of the kiwis was introduced, models like OpenAI’s o1 and Meta’s Llama incorrectly adjusted the final total, even though the extra information had no bearing on the solution.

We found no evidence of formal reasoning in the language models. Their behavior is better explained by sophisticated pattern matching — so fragile that changing names can change the results by about 10%.

This fragility in reasoning led the researchers to conclude that the models don’t use real logic to solve the problems, but instead rely on sophisticated pattern recognition learned during training. They found that “simply changing names can change the results,” a potentially worrying sign for the future of AI applications that require consistent, accurate reasoning in real-world contexts.

According to the study, all of the models tested, from small open-source versions like Llama to proprietary models like OpenAI’s GPT-4o, showed significant performance degradation when faced with seemingly minor changes in input data. Apple suggests that AI may need to combine neural networks with traditional, symbol-based reasoning, called neurosymbolic AI, to achieve more accurate decision-making and problem-solving capabilities.

Tags: Apple Research, Artificial Intelligence [ 152 comments ]

Leave a Reply