Skip to content

Self-Assessment

Three rubrics, one for each scope. Each places the subject on a Spine level (L1–L5) and scores each Axis (0–4). The Spine level is capped at L3 if Verification is at the bottom — the core rule: no real agentic maturity without verification.


Spine placement — pick the highest statement that is reliably true under pressure (not your best day):

  • L1 — I type requests into a chat and use what comes back; I rarely read diffs.
  • L2 — I keep reusable prompts (roles, examples, format constraints) and review diffs before accepting.
  • L3 — I maintain a personal/project CLAUDE.md, curate my tools/MCPs, and manage context deliberately.
  • L4 — I run agents in a loop against a written spec + a check they can run, and I review every diff. A meaningful share of my work is delegated-and-verified.
  • L5 — I orchestrate multiple/parallel agents, run my own evals, and feed learnings back across sessions.

Axis scores (0 = absent, 4 = strong):

Axis024
VerificationEyeballs outputRuns tests/build manually before mergeAgent runs its own check; personal eval harness
Context hygieneOne long session for everythingClears between tasksCurated CLAUDE.md + subagents isolate exploration + compaction
Autonomy / leashApproves everything, or blindly auto-acceptsAuto-accepts low-risk behind a checkDials leash per risk; earns it via proven checks; sandboxes the rest
LearningRepeats the same correctionsAdds fixes to CLAUDE.md sometimesCorrections + memory feed back systematically; builds skills
Cost & governanceNo cost senseAware of token costToken-efficient tooling; respects tier/permission discipline

A useful north star: the delegation ratio — the share of your work that is delegated-and-verified to agents. The “and-verified” is what separates L4 from fast L1.


2. Codebase (repo) — the automatable one

Section titled “2. Codebase (repo) — the automatable one”

This scope is mostly binary, file-existence checks. Score = the % of checks passed, mapped to a level. A repo must pass ~80% of a level’s checks to claim it.

PillarConcrete signal (pass/fail)
Agent instructionsA CLAUDE.md / AGENTS.md exists at root, is non-trivial, references example files
TestingA test suite exists; CI runs it on PRs (the agent’s feedback loop)
Build / validationReproducible build; linter + formatter + type-checker configured (guardrails)
DocsREADME + setup/contributing docs present
Dev environmentOne-command setup (a Makefile, devcontainer, or script)
Code qualityPre-commit / pre-push hooks enforce style
ObservabilityLogging/monitoring so agent actions are inspectable
Security / governanceSecret scanning; ignore-hygiene; access conventions
Evals (unlocks L4+)An eval suite or LLM-as-judge harness lives in the repo

Level mapping:

LevelMeaning
L1 FunctionalA human can build and run it
L2 Structured+ CLAUDE.md, linter/formatter, basic tests
L3 Agent-ready (target)+ CI-on-PR, type-checker, one-command setup, secret scanning, docs
L4 Measured+ eval harness, observability, pre-commit gates
L5 Self-improving+ a compounding CLAUDE.md, shared skills/commands, agent-readable docs

A maturity radar, not a single number. Dimensions borrowed from DORA’s 2025 AI Capabilities Model:

DimensionSignal
AI stanceA documented, communicated policy on permitted tools/usage
Data ecosystemInternal data/docs are high-quality, unified, and AI-accessible
Version control & batch sizeStrong VCS discipline; small, frequent changes
Internal platformShared agent harnesses/skills; a quality internal platform
Throughput & stabilityDelivery metrics tracked — watch change-fail rate as throughput rises
TrustThe share of developers who trust AI-generated code (closing the gap is maturity)
Adoption depth% of repos at L3+; % of workflows with eval coverage
Org learning loopDo evals/learnings feed back into shared prompts, skills, and conventions?

  1. Place the subject (Spine level + Axis scores), applying the Verification cap.
  2. Prescribe — take the weakest axis and pick the matching practices from the Pattern Library.
  3. Re-assess over time — quarterly for people, per-PR for repos, per-quarter for teams — and track the progression.