TECH

Apple study proves LLM-based AI models are flawed because they can't reason

Apple plans to introduce its own version of AI starting with iOS 18.1 – image courtesy of Apple

A new paper from Apple AI researchers suggests that engines based on large language models could like Meta and OpenAI, Still lacking in basic reasoning skills.

The group proposed a new benchmark, GSM-Symbolic, to help others measure the reasoning capabilities of various large language models (LLMs). Their initial testing shows that small changes in query formulations can lead to significantly different answers, undermining the models’ robustness.

The group investigated the “fragility” of mathematical reasoning by adding contextual information to their queries that a human can understand but that shouldn’t affect the fundamental mathematics of the solution. This led to different answers that shouldn’t happen.

“Specifically, the performance of all models deteriorates [even] when only the numeric values ​​in the question are changed in the GSM-Symbolic benchmark,” the group wrote in their report. “Furthermore, the fragility of the mathematical reasoning in these models [demonstrates] that their performance deteriorates significantly as the number of sentences in a question increases.”

The study found that adding even one sentence that appeared to offer relevant information to a given math question could reduce the accuracy of the final answer by up to 65 percent. “It’s simply not possible to build reliable agents on this basis, where changing one or two words in an irrelevant way or adding a few bits of irrelevant information could give you a different answer,” the study concluded.

A lack of critical thinking

A specific example to illustrate the problem was a math problem that required genuine understanding of the question. The problem the team developed, called “GSM-NoOp,” was similar to the type of math “word problems” an elementary school student might encounter.

The prompt began with the information needed to formulate the outcome. “Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks twice as many kiwis as he did on Friday.”

The query then adds a sentence that seems relevant but doesn’t actually relate to the final answer, noting that of the kiwis picked on Sunday, “five of them were slightly smaller than average.” The queried answer simply asked “how many kiwis does Oliver have?”

The note about the size of some of the kiwis picked on Sunday shouldn’t have any bearing on the total number of kiwis picked. However, OpenAI’s model, as well as Meta’s Llama3-8b, subtracted the five smaller kiwis from the overall result.

The faulty logic was confirmed by a previous 2019 study that could reliably confuse AI models when asked about the ages of the previous two Super Bowl quarterbacks. By adding background and related information about the games they played in and a third person who was the quarterback in another game, the models produced incorrect answers.

“We found no evidence of formal reasoning in the language models,” the new study concluded. LLMS’s behavior is “better explained by complex pattern matching,” which the study found is “so fragile that [simply] changing names can change the results.”

Follow AppleInsider on Google News

Leave a Reply