General Discussion
Related: Editorials & Other Articles, Issue Forums, Alliance Forums, Region ForumsA very pro-AI account on both Bluesky and X posted about a "disturbing" Stanford paper on LLMs' failures at reasoning
The account is @godofprompt on both platforms. But they stopped using Bluesky quickly after getting almost no response there when they have 233,000 followers on X.
Their post on this disturbing (for AI fans) new study is only on X. Since it's a social media post I'll quote it in full, but break the X link with spaces. The tweet below was posted at 2:40 AM ET this morning.
https:// x. com/ godofprompt/ status/2020764704130650600
@godofprompt
🚨 Holy shit Stanford just published the most uncomfortable paper on LLM reasoning Ive read in a long time.
This isnt a flashy new model or a leaderboard win. Its a systematic teardown of how and why large language models keep failing at reasoning even when benchmarks say theyre doing great.
The paper does one very smart thing upfront: it introduces a clean taxonomy instead of more anecdotes. The authors split reasoning into non-embodied and embodied.
Non-embodied reasoning is what most benchmarks test and its further divided into informal reasoning (intuition, social judgment, commonsense heuristics) and formal reasoning (logic, math, code, symbolic manipulation).
Embodied reasoning is where models must reason about the physical world, space, causality, and action under real constraints.
Across all three, the same failure patterns keep showing up.
> First are fundamental failures baked into current architectures. Models generate answers that look coherent but collapse under light logical pressure. They shortcut, pattern-match, or hallucinate steps instead of executing a consistent reasoning process.
> Second are application-specific failures. A model that looks strong on math benchmarks can quietly fall apart in scientific reasoning, planning, or multi-step decision making. Performance does not transfer nearly as well as leaderboards imply.
> Third are robustness failures. Tiny changes in wording, ordering, or context can flip an answer entirely. The reasoning wasnt stable to begin with; it just happened to work for that phrasing.
One of the most disturbing findings is how often models produce unfaithful reasoning. They give the correct final answer while providing explanations that are logically wrong, incomplete, or fabricated.
This is worse than being wrong, because it trains users to trust explanations that dont correspond to the actual decision process.
Embodied reasoning is where things really fall apart. LLMs systematically fail at physical commonsense, spatial reasoning, and basic physics because they have no grounded experience.
Even in text-only settings, as soon as a task implicitly depends on real-world dynamics, failures become predictable and repeatable.
The authors dont just criticize. They outline mitigation paths: inference-time scaling, analogical memory, external verification, and evaluations that deliberately inject known failure cases instead of optimizing for leaderboard performance.
But theyre very clear that none of these are silver bullets yet.
The takeaway isnt that LLMs cant reason.
Its more uncomfortable than that.
LLMs reason just enough to sound convincing, but not enough to be reliable.
And unless we start measuring how models fail not just how often they succeed well keep deploying systems that pass benchmarks, fail silently in production, and explain themselves with total confidence while doing the wrong thing.
Thats the real warning shot in this paper.
Paper: Large Language Model Reasoning Failures
Link to the paper: https://arxiv.org/abs/2602.06176
SheltieLover
(78,346 posts)Iris
(16,861 posts)That's not my understanding of how it works at all.
EdmondDantes_
(1,538 posts)Some of the more gung ho AI people do. And it's definitely part of the vernacular around AI. There's lots of similar words used to describe what an AI is doing when an answer is being generated
cachukis
(3,756 posts)My thinking on the fallacy of AI becoming intelligent is its inability for humility.
It is working to become Spock, but without a Kirk rebuttal.
I'm thinking it is going to be a more advanced computer building on video games for human engagement on the day to day.
It will advance and fail on practical application, but it will get better over time.
I fear the laziness of most brains will succumb to its easy solutions. We are seeing how the entertainment value of social media has taken over credentialed knowledge.
People are now creating their own Chat friends.
I fear the technological advantages for the business goals will have long term systemic disruptions in social systems that will be difficult to escape.
Can we draw guidelines to handle the dichotomy?
Worry the money runs the show.
Response to cachukis (Reply #3)
Whiskeytide This message was self-deleted by its author.
Whiskeytide
(4,647 posts)if Spock really had no humanity or humility or compassion, he likely would have simply killed all of the crew on the Enterprise because they kept creating illogical and dramatic plot lines for the show.
rog
(929 posts)... using a LLM to help summarize, organize, simplify, etc sources that the user supplies without asking the model to reason.
hunter
(40,488 posts)I really don't understand how anyone attributes "intelligence" to these automated plagiarism machines.
There are some aspects of this paper that bother me. For example, I think it's absurd to talk about such things as "LLM Reasoning Failures" when there's no reasoning going on at all.
Are we all so conditioned by our education that we think answering questions or writing short essays for an exam is some kind of "reasoning?" It's not.
I'll give an example: Sometimes I meet Evangelical Christian physicians who tell me they don't "believe in" evolution. They might even "believe" that the earth is merely thousands of years old and not billions. They've obviously passed Biology exams to become physicians, they've witnessed the troublesome quirks of the human body that can only be explained by evolution, yet they've never applied any of that to their own internal model of reality. There's an empty space where those models ought to exist. ( Or possibly they are lying to themselves, which is the worst sort of lie. )
With AI it's all empty space. The words go in and the words come out without anything in between.
Whenever I write I'm always concerned that I'm letting the language in my head do my thinking for me; that I'm being the meat based equivalent of an LLM. If I'm doing that I don't really have anything to say. I want all my writing to represent my own internal models of reality as shaped by my own experiences.
LLMs don't have any experiences.
sakabatou
(45,942 posts)odins folly
(567 posts)The Dunning Kruger effect. Its a machine. It takes in info, compares that to what it has been fed before and regurgitates an answer. But it doesnt have the ability to reason and speculate a correct answer, and it shows confidence that that is the correct answer.
And its programming doesnt allow it to believe it could be wrong. It doesnt and cant know what it doesnt know.
purr-rat beauty
(1,105 posts)...half the SB commercials looked cheap
Plus they were boooooooring
