ninety-graph · part II

A CLI that guides the AI.

How we design command-line tools that act as guard rails for an agent — and close the loop so the agent retrospects its own work and rewrites the skill that handles the next ticket.

Three ideas: the CLI owns the operations, the LLM owns the judgment · a gated state machine the agent can't skip · and a self-improving loop where every improvement leaves a trace you can audit.

the design choice

MCP hands the model a toolbox. We hand it a procedure.

MCP · a toolbox

A flat list of tools — the model picks which, and in what order.
Ordering rules live in prose descriptions the model is free to ignore.
Every tool's schema sits in the context window, every turn.
Needs a host and a live connection that speaks the protocol.

The model orchestrates — so it can skip a step, repeat one, or write before it's allowed. Correctness rides on the model's discretion.

gated CLI · a procedure

One command per step — each refuses unless the state is right.
The order is enforced by the tool, not described to the model.
Discovered on demand (--help, next) — no context tax.
Runs anywhere there's a shell: the agent, cron, CI, or a human.

The correct path is the only path — and every caller, LLM or not, drives the exact same rail.

Not anti-MCP — it's about where the guard rails live. A tool list exposes capabilities; a gated CLI encodes a process. When an agent acts on production, we want the process to be the thing it can't get wrong.

the inversion

Don't hand the agent the raw system. Hand it a guide.

The CLI owns every operation.All the data wiring lives in one known-good code path — never improvised in a prompt.
The LLM owns the judgment.It decides what happened and passes flat, validated facts: --reporter, --company, --defect.
The agent never touches the store directly.It says "I'm classifying this ticket." The CLI makes that true — atomically — or refuses and says why.

# the agent's whole vocabulary is verbs, not queries $ ninety-graph ticket classify t-... \ --reporter yohanes@acme.com \ --company acme --system order-version \ --defect status-desync::order-shows-lost Classified t-... (state: classified). Next: ninety-graph ticket resolve t-... --type ...

the toolbelt

One agent, many systems — each behind a guided tool.

triage · where work lands

SlackLinear

Tickets arrive in #platform-feedback; the work is tracked as Linear issues.

diagnose · read the truth

DatadogBigQuery

Logs · traces · metrics, plus production data replicas (invoices, payments, orgs, videos…).

act · change the system

ninety CLI

Drives the services: VAAS · Postgres · Xero · Stripe · SendGrid · Sendbird · Typesense.

capture & improve

ninety-graph

The gated CLI — ticket lifecycle, defects, and retrospections. The one we're zooming into.

ninety-graph is one tool in the belt. The pattern that makes it safe for an agent to drive — gate · validate · audit — is the same pattern we want behind every system on this slide.

technique 1 · the rail

A gated state machine the agent can't skip.

open—classify→ classified—resolve→ resolved—retrospect→ done

The correct path becomes the only path.An LLM skips steps, repeats them, or tries to do everything at once. A gate makes a half-classified or out-of-order ticket unrepresentable — not just discouraged.
The error is the next instruction.Wrong state writes nothing and exits 5 with the exact command that advances the ticket — the agent recovers without guessing, and without a human.
The agent never has to infer where it is.ticket next <id> always prints the current state and the next command — removing a whole class of hallucinated next-steps.
Safe to run in parallel.Each step is conditional on the current state, so two agents at once can't corrupt each other — a concurrent move is just a no-op.

$ ninety-graph ticket resolve t-42 --type fix --summary "..." ERROR Ticket 't-42' is in state 'open' but ticket resolve requires state 'classified'. Nothing was written. Next: ninety-graph ticket classify t-42 --reporter ... exit 5 # the error IS the instruction. the agent just follows it.

technique 2 · garbage never leaves the CLI

Validate before the network. A grammar per flag.

Bad data should be impossible, not just unlikely.An LLM will confidently pass a hallucinated id or an enum that doesn't exist. We reject it locally — so the store's integrity never rides on the model being careful. Checked before any call: UUIDs, ISO-8601, kebab keys, bounded enums, fingerprints.
An error should be a correction the model can act on.Each message names the flag and shows a valid example — so the agent fixes itself and retries, with no human in the loop and no stall.
Failure has to be legible to recover from it.A small exit-code grammar lets the agent branch: fix my input, back off, do the prerequisite first. One opaque error would just strand it.

0 ok 2 validation 3 API error 4 not found 5 wrong state

$ ninety-graph ticket link-linear t-42 --key eng-1234 ERROR Invalid --key 'eng-1234' — must be a Linear issue key like ENG-1234. exit 2 · nothing left the process # bad input dies locally — it never reaches the backend, # so the data can never hold something malformed.

technique 3 · provenance by construction

One command = one transaction = one audit record.

Autonomy demands accountability.If the agent acts on its own, you must be able to ask "what did it do?" later — so every change writes its own audit record in the same transaction it commits.
The trail can't drift from reality.What · when · by whom · to what, captured automatically — nobody has to remember to log, so the record is always complete and always true.
Cheap review is what makes acting safe."What happened, when, by whom?" is one command — ninety-graph log — so auditing the agent costs seconds, not an investigation.

$ ninety-graph log --since P7D 2026-06-16T09:12Z ticket.classify witek t-42 2026-06-16T09:30Z ticket.resolve witek t-42 (fix) 2026-06-16T11:02Z retro.start ai-bot 10 tickets 2026-06-16T11:40Z retro.finalize ai-bot skillver-... # the audit trail isn't bolted on — it's the same write.

technique 4 · no side doors

Reads are open. Writes have exactly one path.

A capable agent finds the shortest path to its goal — including a raw write that skips your rules.So the free-form read/inspect surface is read-only by construction: it can explore anything and mutate nothing.
The rules you encoded actually hold.Every change goes through a gated verb — the guard rails are an invariant, not a suggestion the model can route around.
When there's a real gap, widen the safe surface — don't open a backdoor."Link a defect after classifying" was missing, so we added a gated verb (ticket link-defect) — not a way around the gate.

$ ninety-graph query "<read-only lookup>" # explore freely → results … $ ninety-graph query "<an attempted change>" ERROR this command is read-only — writes are refused. change things through a gated verb. exit 3 # discovery is open; mutation is walled. # the only door to a write is the verb that guards the rule.

from rails → to a loop

Guard rails make the agent safe. The loop makes it better.

Every ticket the agent closes is now a clean, audited record. That record is the raw material for the next idea: have the agent review a batch of its own work, distil what it learned, and rewrite the skill it uses — leaving a trace at every step.

The ticket layer is the foundation. The review / improve layer — shown next — runs on the same gated, audited verbs, so improvements are as traceable as the work that produced them.

the self-improving loop

Resolve → retrospect → improve → resolve better.

Each box is a real CLI command that records what it did — the loop runs on the same gated, audited verbs as the ticket lifecycle.

step 1 · gather

A session reviews a focused batch — and connects it.

retro start atomically attaches every eligible ticket — resolved/done, not yet reviewed by any session.
Bounded to 10 at a time; it reports how many remain so nothing is silently dropped.
Decoupled from the lifecycle: reviewing a ticket never moves its state — a session is a learning layer beside the machine, not part of it.

$ ninety-graph retro start --slug june-batch \ --skill platform-feedback-triage --summary "June retro" Started retro-session-2026-06-16-june-batch (open). Attached 10 tickets: t-31 t-33 t-37 t-40 t-42 ... 7 more eligible tickets remain — run retro start again.

step 2 · learn (with receipts)

Capture lessons and skill gaps — each tied to its evidence.

capture-lesson records the lesson against this session's view of the ticket — never changing the ticket itself.
capture-improvement --from-ticket records a skill gap and links it to the ticket and its resolution.
So every improvement is traceable to the resolution that motivated it — not a free-floating opinion.

$ ninety-graph retro capture-improvement june-batch \ --summary "add a 5-systems tabulation checklist" \ --from-ticket t-42 Recorded improvement improvement-9f… (from t-42). linked to ticket t-42 linked to its resolution

step 3 · improve

Finalize: a proposed skill version + a draft the human reviews.

retro finalize records a new skill version linked to every improvement the session captured.
It writes the improved skill to a draft file — <skill>.<session>.proposed.md. The live skill is never overwritten.
The CLI persists the trace; the human promotes the draft. Self-improving, not self-deploying.

$ ninety-graph retro finalize june-batch \ --content-file ./triage.improved.md Finalized session (state: closed). Recorded skill version, addressing 4 improvements. Wrote → platform-feedback-triage.june-batch.proposed.md The LIVE skill was NOT modified — review & promote.

storing the traces · versioning

A new version is the active skill, fully rewritten.

The LLM regenerates the whole skill — it doesn't patch it.It reads the current active version and the session's list of improvements, then rewrites the entire skill into a new version.
One coherent document, not a stack of diffs.A regenerated skill reads as one voice — nothing for the next agent to reconcile.
It lands as a draft, on purpose.Written beside the live skill for a human to promote — the active version never changes by accident.

Not a diff — a regeneration: the model rewrites the whole skill from the active version plus the improvements.

storing the traces · provenance

Every version reads back to its evidence.

The new version isn't a black box. It records the improvements it answers — and each of those points to the ticket and resolution that surfaced it. Follow the arrows backwards and any line in the skill traces to the work that justified it.

Same audit spine as the ticket lifecycle — so an improvement is as accountable as the resolution that prompted it.

why improvements stay separate

Generate the skill from concepts — never bake them in.

Edit the prompt directly and it's a one-way door.Hand-merge a lesson into the skill text and the concept is compressed into prose — you can no longer see which line came from which lesson, or pull it back out. Backtracing is gone.
So every improvement stays a separate record, tied to its retrospection.Discrete, provenance-carrying units — not diffs smeared into the document. You always know what an improvement was and where it came from.
The skill is regenerated from that set — apply or pull out any of them.The source of truth is the list of improvements, not the prose. Include one, exclude another, regenerate, and diff the drafts — every change stays reversible and testable.

imp Aimp Bimp Cimp D

→ draft-v1.proposed.md

imp Aimp Bimp Cimp D

→ draft-v2.proposed.md · B dropped

diff the two → see exactly what imp B was worth

Concepts live as records, not prose — so the skill is always re-derivable, and any improvement can be applied or pulled back out. Bake them in and you lose that forever.

the payoff

A skill that rewrites itself — with receipts.

guided

The agent moves through a gated machine it can't skip or corrupt. Every error is the next instruction.

self-improving

Batches of resolved tickets become lessons → skill gaps → a proposed new skill version, on a loop.

auditable

Every improvement traces back to the exact ticket, resolution, and reasoning that justified it.

The trail isn't just memory — it's a benchmark accumulating underneath. Each resolution is a frozen eval item; each improvement is a hypothesis with provenance. That's the substrate for grading change, not guessing at it.

over to you

Discussion & questions.

Where should the line between the agent's judgment and the tool's guard rails sit — and how far does this generalize beyond support tickets?

trust & autonomy

How much resolution runs unattended — and where does the human stay in the loop?

measuring "better"

What's the right benchmark when you can't re-run the past against today's world?

beyond tickets

Which of these guard rails generalize to other agent workflows we run?

the north star

Agents move from running our tools to operating inside the ticketing system itself — turning context into execution, while humans keep the intent, judgment, and taste. Our gated CLI is how we get there safely: the procedure stays the thing the agent can't get wrong. linear.app/next ↗