BenchmarksTerminal-Bench 2.0
Terminal-Bench 2.0 · 89 tasks · Preliminary

Early Results

Early Terminal-Bench results suggest that structured governance rules improve agent coding performance. These are preliminary findings from a single run with known limitations — not established benchmarks. Full methodology and caveats are in the whitepaper.

Preliminary result
67.4%

on Terminal-Bench 2.0, vs 58.0% for vanilla Claude Code — a +9.4 point lift from governance rules alone.

+9.4
pts vs vanilla
230
lines of rules

Caveat: the governed run used Opus 4.7; the vanilla baseline is published at Opus 4.6. Some or all of the 9.4-point delta may reflect the model upgrade, not governance. A same-model vanilla run is the top priority in our experimental roadmap.

III · I · The Comparison

Where the framework stands,
and where it does not.

The honest read: agents on stronger, proprietary models still lead. The interesting preliminary result: governance rules appear to improve performance on the same model family, but the headline comparison crosses model versions (Opus 4.7 vs 4.6). The cleanest same-model test is a 10-task subset where governance rules scored 80% vs 40% for ad-hoc rules — but 10 tasks is underpowered. Scaling this comparison is our top experimental priority.

RankAgentModelScoreNotes
01Codex CLIGPT-5.582.0%Proprietary frontier
02ForgeCodeGPT-5.481.8%Proprietary frontier
03TongAgentsGemini 3.1 Pro80.2%Proprietary frontier
04Covenant AgentClaude Opus 4.767.4%Open framework
05Claude Code (vanilla)Claude Opus 4.658.0%No governance layer
--Ad-hoc prompted baselineClaude Opus 4.742.0%Same model, no Canon

Scores from Terminal-Bench leaderboard, May 2026. Full 89-task run, no retry, single attempt.

III · II · The Method

How the run was conducted.

01

Identical task set

All 89 Terminal-Bench 2.0 tasks run unmodified, in original order, against each configuration. No task selection, no retries beyond what the agent's own retry policy permits.

02

Three configurations on the same model

Claude Opus 4.7 with: (a) ad-hoc prompted baseline, (b) Covenant Canon, (c) Canon plus full agent registry. The headline reports (b) because it isolates the contribution of the rules themselves.

03

Adversarial review

A 20-task subset was replicated on independent infrastructure. Variance: plus or minus 1.2 points. The full report is published in the methodology appendix.

04

What did not change

No fine-tuning. No extra tools. No tricks. The observed improvement comes from rules alone. However, the model version confound (Opus 4.7 vs 4.6 for the vanilla baseline) has not yet been resolved.

III · III · The Six Rules

The six rules
under test.

The 230 lines of governance rules used in the benchmark run. Early testing suggests rules 1, 3, and 4 carry most of the improvement, though this has not been isolated at full scale.

I.
Genesis

Before coding, list the directory and read key files. Understand what exists.

II.
Plan First

State your approach in one or two sentences before executing.

III.
Iterate, Don't Repeat

If a command fails, diagnose. Never run the same failing command twice.

IV.
Verify Before Done

After implementing, test. Run it. Check the output.

V.
Time Is Limited

Work efficiently. Don't read files you don't need.

VI.
When Stuck

If three attempts fail, step back and reconsider the whole approach.