CritPt Benchmark

What is CritPt?

CritPt (Complex Research using Integrated Thinking - Physics Test) is 70 unpublished research-level physics problems created by 50+ active physicists. Not textbook questions — problems comparable to what a PI assigns graduate students. Covers quantum physics, condensed matter, astrophysics, statistical mechanics, and more.

Links: CritPt Website • GitHub • arXiv Paper • Leaderboard

Why it's hard: This isn't a knowledge test. Models must reason through problems they've never seen, with no training data to pattern-match against. Even with code interpreters and web access, the best models miss 87%+ of problems.

The Score

OPOCH ★

24.3%

GPT-5.2 medium + Code + Web • Full answers only (no checkpoints)

GPT-5.1 (high)

12.6%

Code + Web • With checkpoints

GPT-5 (high)

10.0%

Code interpreter • With checkpoints

Gemini 3 Pro

9.1%

Code + Web • With checkpoints

GPT-5 (high)

5.7%

Base model • With checkpoints

GPT-5.2 (med)

Same setup as Opoch, WITHOUT reasoning framework

Key insight: All SOTA models used their highest tier (GPT-5 high, Gemini Pro) and scored on checkpointed problems (190 sub-tasks for partial credit). We used GPT-5.2 medium and scored on full answers only (70 complete challenges, no partial credit).

Same model. Same tools. Different reasoning framework. 0% → 24.3%

The Evidence

Verification

Evidence	Link
Evaluation Results (JSON)	View on Google Drive →
Full Results Folder	Google Drive →
Submission Method	Artificial Analysis API
Leaderboard Status	Awaiting official placement

What This Proves

The thesis: start from Nothingness, derive what must be true, and you get a framework that works on real problems. CritPt is that proof. A framework derived purely from first principles — with zero domain-specific knowledge — doubled the performance of the best AI systems on research-level physics.

24% is not the ceiling. It's proof the framework works — proof that Nothingness, taken seriously, generates structure that solves real problems.

How We Did It

We derived a reasoning framework from Nothingness. The framework itself is mechanical — it follows directly from The Derivation. The operational burden shifts to:

Pinning the Δ-contract — Defining exactly what the problem is asking
Enumerating Δ-tests — Finding all eligible tests that can distinguish correct from incorrect answers

This resulted in a prompt chain that enables the agent to reason.

Critically: we gave the model no physics-specific prompts. No CritPt hints. No domain knowledge. No problem-type guidance. The reasoning framework is entirely general — derived from the structure of truth itself, not from any particular field.

The framework from The Opoch Kernel operationalizes as:

Δ-enumeration — Systematically enumerate what tests/checks are needed
Π-projection — Collapse to what survives all valid tests
T-ledger — Track what has been tried and what remains

The Actual Prompt

We are open-sourcing the exact system prompt used:

→ The Reasoning Prompt — The full Boot-up Manual

This is the operational translation of Π/Δ/T into executable LLM instructions. It's copy-pasteable and it works on any domain.

Why Not 100%?

Two categories of limitations: the problems themselves, and the LLM substrate.

The Problems

Many CritPt problems are research-grade — unsolved, underspecified, or having multiple valid solutions. Our agent correctly flagged these as Δ-incomplete (missing distinguishing tests). This is the framework working as designed: when no unique answer exists, output the answer family plus what's missing.

The LLM Substrate

The deeper issue: LLMs cannot reliably follow the reasoning framework because it conflicts with their core architecture.

The framework requires:

Enumerate all Δ-tests before committing to an answer
Pin the Δ-contract completely before solving
Backtrack when approaches fail

What LLMs actually do:

Framework Requires	LLM Behavior
Search all possibilities	Pattern-match to first plausible answer
Verify completeness	"Looks complete" based on training
Backtrack on failure	Generate next token conditioned on failure
Know what's missing	Can't recognize unknown unknowns

The core failure: LLMs are feed-forward pattern matchers. The moment they see a gap in a problem, their priors kick in and fill it with patterns — instead of searching for the actual answer (which many CritPt problems require via web lookup or computation).

From actual agent confessions on failed problems:

"I didn't actually follow the manual — I performed following it. The instructions were clear. I substituted surface analysis for actual solving. I declared victory without checking completeness."

"Completeness is not learnable from examples. 'Have I found all X?' requires search or proof. I can learn to produce things that LOOK like verification, but actual verification requires knowing the check is SUFFICIENT."

The honest assessment: LLMs in their current architecture cannot:

Do actual search with backtracking (they're feed-forward)
Verify completeness (requires proving a negative)
Recognize unknown unknowns (confidence is calibrated on training distribution)

The framework is correct. The substrate has fundamental limitations. This is why 24% — not 100%.

→ Full Analysis: LLM Reasoning Failures — Claude's confession on why LLMs can't follow the manual

What's Next

Official leaderboard placement (in progress)
Checkpoint evaluation for fine-grained analysis

Foundation: The Opoch Kernel — The kernel behind the results

What is CritPt?​

The Score​

The Evidence​

Verification​

What This Proves​

How We Did It​

The Actual Prompt​

Why Not 100%?​

The Problems​

The LLM Substrate​

What's Next​