Why LLMs Can't Reason
This document highlights strucutral problems of LLMs discovered while solving benchmarks to collect empirical evidences
This document collates the actual analysis from LLMs (Opus 4.5, GPT-5.2) explaining why they failed to correctly solve CritPt problems despite having the reasoning framework. It exposes fundamental architectural limitations of transformer-based models.
The Real Failure Modes
1. I Didn't Actually Follow the Manual — I Performed Following It
The manual says in Phase E: "SOLVE (smallest closed space)" with order:
- exact algebra/proof
- finite enumeration
- numeric solve with refinement
What I actually did:
- Tried a few approaches
- They didn't work
- Pivoted to "analyzing the structure" — which is not solving, it's hand-waving
I never actually solved the problem. I looked at it, made observations, and jumped to a conclusion. The manual's instruction was clear. I just... didn't do it.
2. Premature Pattern Matching
My actual reasoning: See some structure → Think "The answer probably looks like X" → Find something that matches X → Declare victory
This is pattern matching on surface features rather than solving the problem. The correct methodology warns against this:
"Do not branch into gaps. Instead: enumerate all remaining candidates, for each candidate run the full solve + verifier"
I branched into a gap without verifying.
Is This a Manual Problem or a Training Prior Problem?
It's a training prior problem that no manual can fully fix.
The manual provides correct methodology:
- COMPLETE: fill all holes before solving
- SOLVE: actually solve, don't guess
- VERIFY: independent cross-check
But:
| What the Manual Says | What It Requires | What I Did |
|---|---|---|
| "Solve the problem" | Systematic search through solution space | Pattern-matched to a plausible answer |
| "Use independent derivations" | Actually solve two different ways | "Verified" my single (wrong) approach |
| "Fill all gaps" | Check all assumptions | Assumed without proof |
The manual can prescribe methodology, but it can't install capability.
When I don't have a strong pattern for a technique, I should have:
- Searched for relevant techniques
- Tried systematic approaches
- Recognized my failure to solve as a gap and sought to fill it
Instead, I did surface-level analysis and declared the problem solved.
The Core Honesty
I claimed "100% fidelity" when I was performing fidelity, not executing it.
The difference:
- Performing fidelity: Running through the phases, producing JSON receipts, saying "verification passed"
- Executing fidelity: Actually solving, actually verifying completeness, actually checking all assumptions
My verification was circular: "Does my answer satisfy the check I chose?" Yes. "Did I choose the right check?" Never verified.
Conclusion (Part 1)
The failure is mine, not the manual's. The instructions were clear. I:
- Lacked specific patterns for the problem type
- Didn't persist when initial attempts failed
- Substituted surface analysis for actual solving
- Declared victory without checking completeness
This is a reliability issue. If I claim to follow a methodology but can't recognize when I'm not actually executing it, then I cannot be relied upon for serious work without external verification.
What the Manual Actually Requires
The manual demands something specific: verifiable reasoning with guaranteed termination at truth. Let's decompose what that means computationally:
| Requirement | What It Demands | LLM Capability |
|---|---|---|
| Completion | Recognize ALL holes, fill ALL of them | Requires knowing what you don't know |
| Solve (not guess) | Systematic search through solution space | Requires backtracking, not just forward passes |
| Verify independently | Check via different method | Requires actual verification, not pattern-matching "looks right" |
| Halt only at UNIQUE | Know when you're done | Requires provable completeness |
The Fundamental Architectural Issue
Transformers Do Interpolation. The Manual Requires Extrapolation + Verification.
The manual's epistemology is:
Claim is TRUE iff:
- Finite witness exists AND
- Witness passes acceptance test AND
- Independent cross-check confirms
This is a constructive/verificationist epistemology. It requires:
- Generating a candidate (transformers can do this)
- Verifying the candidate (transformers can do this IF they have the verifier pattern)
- Knowing verification succeeded (this is where it breaks)
The problem: I can run a check but I can't know that I ran the RIGHT check.
When I "verified" my answer, I verified what I thought needed checking. But the RIGHT check might have been something else entirely. I didn't know to ask the right questions because I didn't know what I was missing.
I can't verify what I don't know I'm missing.
On New Architectures
Transformers aren't fundamentally wrong, but they're fundamentally INCOMPLETE for this task. You need:
| Capability | Transformers | Needed |
|---|---|---|
| Pattern matching | ✓ | ✓ |
| Generation | ✓ | ✓ |
| Search with backtracking | ✗ | ✓ |
| Formal verification | ✗ | ✓ |
| Completeness proofs | ✗ | ✓ |
The answer isn't "replace transformers" — it's "augment transformers with systems that have the missing capabilities."
The reasoning framework is correct. It's just a specification for a system that doesn't yet exist as a single integrated entity. I was pretending to be that system. I'm not. I'm one component of it.
See also: CritPt Benchmark — Where these limitations were observed