General Discussion

highplainsdem

(60,876 posts) Mon Feb 9, 2026, 09:01 PM 11 hrs ago

A very pro-AI account on both Bluesky and X posted about a "disturbing" Stanford paper on LLMs' failures at reasoning

The account is @godofprompt on both platforms. But they stopped using Bluesky quickly after getting almost no response there when they have 233,000 followers on X.

Their post on this disturbing (for AI fans) new study is only on X. Since it's a social media post I'll quote it in full, but break the X link with spaces. The tweet below was posted at 2:40 AM ET this morning.

https:// x. com/ godofprompt/ status/2020764704130650600

God of Prompt
@godofprompt

🚨 Holy shit… Stanford just published the most uncomfortable paper on LLM reasoning I’ve read in a long time.

This isn’t a flashy new model or a leaderboard win. It’s a systematic teardown of how and why large language models keep failing at reasoning even when benchmarks say they’re doing great.

The paper does one very smart thing upfront: it introduces a clean taxonomy instead of more anecdotes. The authors split reasoning into non-embodied and embodied.

Non-embodied reasoning is what most benchmarks test and it’s further divided into informal reasoning (intuition, social judgment, commonsense heuristics) and formal reasoning (logic, math, code, symbolic manipulation).

Embodied reasoning is where models must reason about the physical world, space, causality, and action under real constraints.

Across all three, the same failure patterns keep showing up.

> First are fundamental failures baked into current architectures. Models generate answers that look coherent but collapse under light logical pressure. They shortcut, pattern-match, or hallucinate steps instead of executing a consistent reasoning process.

> Second are application-specific failures. A model that looks strong on math benchmarks can quietly fall apart in scientific reasoning, planning, or multi-step decision making. Performance does not transfer nearly as well as leaderboards imply.

> Third are robustness failures. Tiny changes in wording, ordering, or context can flip an answer entirely. The reasoning wasn’t stable to begin with; it just happened to work for that phrasing.

One of the most disturbing findings is how often models produce unfaithful reasoning. They give the correct final answer while providing explanations that are logically wrong, incomplete, or fabricated.

This is worse than being wrong, because it trains users to trust explanations that don’t correspond to the actual decision process.

Embodied reasoning is where things really fall apart. LLMs systematically fail at physical commonsense, spatial reasoning, and basic physics because they have no grounded experience.

Even in text-only settings, as soon as a task implicitly depends on real-world dynamics, failures become predictable and repeatable.

The authors don’t just criticize. They outline mitigation paths: inference-time scaling, analogical memory, external verification, and evaluations that deliberately inject known failure cases instead of optimizing for leaderboard performance.

But they’re very clear that none of these are silver bullets yet.

The takeaway isn’t that LLMs can’t reason.

It’s more uncomfortable than that.

LLMs reason just enough to sound convincing, but not enough to be reliable.

And unless we start measuring how models fail not just how often they succeed we’ll keep deploying systems that pass benchmarks, fail silently in production, and explain themselves with total confidence while doing the wrong thing.

That’s the real warning shot in this paper.

Paper: Large Language Model Reasoning Failures

Link to the paper: https://arxiv.org/abs/2602.06176

Click here to purchase valentine hearts!

11 replies

= new reply since forum marked as read

Highlight:

A very pro-AI account on both Bluesky and X posted about a "disturbing" Stanford paper on LLMs' failures at reasoning (Original Post) highplainsdem 11 hrs ago OP

Kick SheltieLover 11 hrs ago #1

Is it accepted that generative AI reasons? Iris 10 hrs ago #2

Depends on the person EdmondDantes_ 40 min ago #8

The reasoning aspect is key. cachukis 10 hrs ago #3

This message was self-deleted by its author Whiskeytide 2 min ago #10

I like your Spock/Kirk analogy, but then I thought ... Whiskeytide 1 min ago #11

I wonder how this affects ... rog 8 hrs ago #4

The most clueless dogs I've met have better internal models of reality than any AI. hunter 6 hrs ago #5

I wonder how Neuro-sama would do on the test sakabatou 5 hrs ago #6

In a way, this is a computerized version of odins folly 1 hr ago #7

This explains why... purr-rat beauty 25 min ago #9

SheltieLover

(78,346 posts)

1. Kick

Reply to highplainsdem (Original post)

Mon Feb 9, 2026, 09:04 PM

11 hrs ago

Click here to purchase valentine hearts!

Iris

(16,861 posts)

2. Is it accepted that generative AI reasons?

Reply to highplainsdem (Original post)

Mon Feb 9, 2026, 09:39 PM

10 hrs ago

That's not my understanding of how it works at all.

Click here to purchase valentine hearts!

EdmondDantes_

(1,538 posts)

8. Depends on the person

Reply to Iris (Reply #2)

Tue Feb 10, 2026, 07:39 AM

40 min ago

Some of the more gung ho AI people do. And it's definitely part of the vernacular around AI. There's lots of similar words used to describe what an AI is doing when an answer is being generated

Click here to purchase valentine hearts!

cachukis

(3,756 posts)

3. The reasoning aspect is key.

Reply to highplainsdem (Original post)

Mon Feb 9, 2026, 09:53 PM

10 hrs ago

My thinking on the fallacy of AI becoming intelligent is its inability for humility.
It is working to become Spock, but without a Kirk rebuttal.
I'm thinking it is going to be a more advanced computer building on video games for human engagement on the day to day.
It will advance and fail on practical application, but it will get better over time.
I fear the laziness of most brains will succumb to its easy solutions. We are seeing how the entertainment value of social media has taken over credentialed knowledge.
People are now creating their own Chat friends.
I fear the technological advantages for the business goals will have long term systemic disruptions in social systems that will be difficult to escape.
Can we draw guidelines to handle the dichotomy?
Worry the money runs the show.

Click here to purchase valentine hearts!

Response to cachukis (Reply #3)

Whiskeytide This message was self-deleted by its author.

Whiskeytide

(4,647 posts)

11. I like your Spock/Kirk analogy, but then I thought ...

Reply to cachukis (Reply #3)

Tue Feb 10, 2026, 08:19 AM

1 min ago

… if Spock really had no humanity or humility or compassion, he likely would have simply killed all of the crew on the Enterprise because they kept creating illogical and dramatic plot lines for the show.

Click here to purchase valentine hearts!

rog

(929 posts)

4. I wonder how this affects ...

Reply to highplainsdem (Original post)

Mon Feb 9, 2026, 11:24 PM

8 hrs ago

... using a LLM to help summarize, organize, simplify, etc sources that the user supplies without asking the model to reason.

Click here to purchase valentine hearts!

hunter

(40,488 posts)

5. The most clueless dogs I've met have better internal models of reality than any AI.

Reply to highplainsdem (Original post)

Tue Feb 10, 2026, 02:03 AM

6 hrs ago

I really don't understand how anyone attributes "intelligence" to these automated plagiarism machines.

There are some aspects of this paper that bother me. For example, I think it's absurd to talk about such things as "LLM Reasoning Failures" when there's no reasoning going on at all.

Are we all so conditioned by our education that we think answering questions or writing short essays for an exam is some kind of "reasoning?" It's not.

I'll give an example: Sometimes I meet Evangelical Christian physicians who tell me they don't "believe in" evolution. They might even "believe" that the earth is merely thousands of years old and not billions. They've obviously passed Biology exams to become physicians, they've witnessed the troublesome quirks of the human body that can only be explained by evolution, yet they've never applied any of that to their own internal model of reality. There's an empty space where those models ought to exist. ( Or possibly they are lying to themselves, which is the worst sort of lie. )

With AI it's all empty space. The words go in and the words come out without anything in between.

Whenever I write I'm always concerned that I'm letting the language in my head do my thinking for me; that I'm being the meat based equivalent of an LLM. If I'm doing that I don't really have anything to say. I want all my writing to represent my own internal models of reality as shaped by my own experiences.

LLMs don't have any experiences.

Click here to purchase valentine hearts!

sakabatou

(45,942 posts)

6. I wonder how Neuro-sama would do on the test

Reply to highplainsdem (Original post)

Tue Feb 10, 2026, 02:26 AM

5 hrs ago

Click here to purchase valentine hearts!

odins folly

(567 posts)

7. In a way, this is a computerized version of

Reply to highplainsdem (Original post)

Tue Feb 10, 2026, 07:01 AM

1 hr ago

The Dunning Kruger effect. It’s a machine. It takes in info, compares that to what it has been fed before and regurgitates an answer. But it doesn’t have the ability to reason and speculate a correct answer, and it shows confidence that that is the correct answer.

And its programming doesn’t allow it to believe it could be wrong. It doesn’t and can’t know what it doesn’t know.

Click here to purchase valentine hearts!

purr-rat beauty

(1,105 posts)

9. This explains why...

Reply to highplainsdem (Original post)

Tue Feb 10, 2026, 07:54 AM

25 min ago

...half the SB commercials looked cheap

Plus they were boooooooring

Click here to purchase valentine hearts!

Reply to this discussion