Skip to main content

CritPt Benchmark

What is CritPt?

CritPt (Complex Research using Integrated Thinking - Physics Test) is 70 unpublished research-level physics problems created by 50+ active physicists. Not textbook questions — problems comparable to what a PI assigns graduate students. Covers quantum physics, condensed matter, astrophysics, statistical mechanics, and more.

Links: CritPt WebsiteGitHubarXiv PaperLeaderboard

Why it's hard: This isn't a knowledge test. Models must reason through problems they've never seen, with no training data to pattern-match against. Even with code interpreters and web access, the best models miss 87%+ of problems.


The Score

OPOCH ★
24.3%
GPT-5.2 medium + Code + Web • Full answers only (no checkpoints)
GPT-5.1 (high)
12.6%
Code + Web • With checkpoints
GPT-5 (high)
10.0%
Code interpreter • With checkpoints
Gemini 3 Pro
9.1%
Code + Web • With checkpoints
GPT-5 (high)
5.7%
Base model • With checkpoints
GPT-5.2 (med)
0%
Same setup as Opoch, WITHOUT reasoning framework

Key insight: All SOTA models used their highest tier (GPT-5 high, Gemini Pro) and scored on checkpointed problems (190 sub-tasks for partial credit). We used GPT-5.2 medium and scored on full answers only (70 complete challenges, no partial credit).


The Evidence

Verification

EvidenceLink
Evaluation Results (JSON)View on Google Drive →
Full Results FolderGoogle Drive →
Submission MethodArtificial Analysis API
Leaderboard StatusAwaiting official placement

What This Proves

The thesis: start from Nothingness, derive what must be true, and you get a framework that works on real problems. CritPt is that proof. A framework derived purely from first principles — with zero domain-specific knowledge — doubled the performance of the best AI systems on research-level physics.


How We Did It

We derived a reasoning framework from Nothingness. The framework itself is mechanical — it follows directly from The Derivation. The operational burden shifts to:

  1. Pinning the Δ-contract — Defining exactly what the problem is asking
  2. Enumerating Δ-tests — Finding all eligible tests that can distinguish correct from incorrect answers

This resulted in a prompt chain that enables the agent to reason.

Critically: we gave the model no physics-specific prompts. No CritPt hints. No domain knowledge. No problem-type guidance. The reasoning framework is entirely general — derived from the structure of truth itself, not from any particular field.

The framework from The Opoch Kernel operationalizes as:

  1. Δ-enumeration — Systematically enumerate what tests/checks are needed
  2. Π-projection — Collapse to what survives all valid tests
  3. T-ledger — Track what has been tried and what remains

The Actual Prompt

We are open-sourcing the exact system prompt used:

The Reasoning Prompt — The full Boot-up Manual

This is the operational translation of Π/Δ/T into executable LLM instructions. It's copy-pasteable and it works on any domain.


Why Not 100%?

Two categories of limitations: the problems themselves, and the LLM substrate.

The Problems

Many CritPt problems are research-grade — unsolved, underspecified, or having multiple valid solutions. Our agent correctly flagged these as Δ-incomplete (missing distinguishing tests). This is the framework working as designed: when no unique answer exists, output the answer family plus what's missing.

The LLM Substrate

The deeper issue: LLMs cannot reliably follow the reasoning framework because it conflicts with their core architecture.

The framework requires:

  • Enumerate all Δ-tests before committing to an answer
  • Pin the Δ-contract completely before solving
  • Backtrack when approaches fail

What LLMs actually do:

Framework RequiresLLM Behavior
Search all possibilitiesPattern-match to first plausible answer
Verify completeness"Looks complete" based on training
Backtrack on failureGenerate next token conditioned on failure
Know what's missingCan't recognize unknown unknowns

The core failure: LLMs are feed-forward pattern matchers. The moment they see a gap in a problem, their priors kick in and fill it with patterns — instead of searching for the actual answer (which many CritPt problems require via web lookup or computation).

From actual agent confessions on failed problems:

"I didn't actually follow the manual — I performed following it. The instructions were clear. I substituted surface analysis for actual solving. I declared victory without checking completeness."

"Completeness is not learnable from examples. 'Have I found all X?' requires search or proof. I can learn to produce things that LOOK like verification, but actual verification requires knowing the check is SUFFICIENT."

The honest assessment: LLMs in their current architecture cannot:

  • Do actual search with backtracking (they're feed-forward)
  • Verify completeness (requires proving a negative)
  • Recognize unknown unknowns (confidence is calibrated on training distribution)

The framework is correct. The substrate has fundamental limitations. This is why 24% — not 100%.

Full Analysis: LLM Reasoning Failures — Claude's confession on why LLMs can't follow the manual


What's Next

  • Official leaderboard placement (in progress)
  • Checkpoint evaluation for fine-grained analysis

Foundation: The Opoch Kernel — The kernel behind the results