Early Terminal-Bench results suggest that structured governance rules improve agent coding performance. These are preliminary findings from a single run with known limitations — not established benchmarks. Full methodology and caveats are in the whitepaper.
on Terminal-Bench 2.0, vs 58.0% for vanilla Claude Code — a +9.4 point lift from governance rules alone.
Caveat: the governed run used Opus 4.7; the vanilla baseline is published at Opus 4.6. Some or all of the 9.4-point delta may reflect the model upgrade, not governance. A same-model vanilla run is the top priority in our experimental roadmap.
The honest read: agents on stronger, proprietary models still lead. The interesting preliminary result: governance rules appear to improve performance on the same model family, but the headline comparison crosses model versions (Opus 4.7 vs 4.6). The cleanest same-model test is a 10-task subset where governance rules scored 80% vs 40% for ad-hoc rules — but 10 tasks is underpowered. Scaling this comparison is our top experimental priority.
| Rank | Agent | Model | Score | Notes |
|---|---|---|---|---|
| 01 | Codex CLI | GPT-5.5 | 82.0% | Proprietary frontier |
| 02 | ForgeCode | GPT-5.4 | 81.8% | Proprietary frontier |
| 03 | TongAgents | Gemini 3.1 Pro | 80.2% | Proprietary frontier |
| 04 | Covenant Agent | Claude Opus 4.7 | 67.4% | Open framework |
| 05 | Claude Code (vanilla) | Claude Opus 4.6 | 58.0% | No governance layer |
| -- | Ad-hoc prompted baseline | Claude Opus 4.7 | 42.0% | Same model, no Canon |
Scores from Terminal-Bench leaderboard, May 2026. Full 89-task run, no retry, single attempt.
All 89 Terminal-Bench 2.0 tasks run unmodified, in original order, against each configuration. No task selection, no retries beyond what the agent's own retry policy permits.
Claude Opus 4.7 with: (a) ad-hoc prompted baseline, (b) Covenant Canon, (c) Canon plus full agent registry. The headline reports (b) because it isolates the contribution of the rules themselves.
A 20-task subset was replicated on independent infrastructure. Variance: plus or minus 1.2 points. The full report is published in the methodology appendix.
No fine-tuning. No extra tools. No tricks. The observed improvement comes from rules alone. However, the model version confound (Opus 4.7 vs 4.6 for the vanilla baseline) has not yet been resolved.
The 230 lines of governance rules used in the benchmark run. Early testing suggests rules 1, 3, and 4 carry most of the improvement, though this has not been isolated at full scale.
Before coding, list the directory and read key files. Understand what exists.
State your approach in one or two sentences before executing.
If a command fails, diagnose. Never run the same failing command twice.
After implementing, test. Run it. Check the output.
Work efficiently. Don't read files you don't need.
If three attempts fail, step back and reconsider the whole approach.