Orchestrating an Agentic Crew at Scale
A year ago, I proposed that engineering teams should “start slow” to build the foundations for AI integration. Since then, my personal limit-testing - pushing roughly 60,000 lines of code in a two-week sprint - has forced a total reassessment of the engineering “inner loop”.
Before this system, I was spending half my day context-switching, manually re-explaining architectural constraints to agents that had drifted from the project’s original intent. What I’ve landed on is a shift away from “pairing” toward managing an orchestrated system. I call this the hybrid v-team model: a framework where the human is the architect of intent, and a specialized crew of agents manages the mechanical labor of implementation.
The Consolidated Roster and Coach Cam
In this system, I’ve moved past the generalist chatbot. I now operate with a consolidated roster of 18 specialized agents across four priority tiers: P0 Core, P1 Essential, P2 Regular, and Cloud Specialists.
I started with a roster of 32 agents which proved unreliable and unwieldy. The real breakthrough was the “Absorbs” pattern, where I merged 32 original personas into 18 capability-based agents to reduce coordination overhead, with a prioritized set and others only called in on-demand. This starts with Coach Cam (P0), my human interface agent. Cam’s job is to pressure-test my vision via 5 Whys, inversion, and constraint surfacing before a single line is written. Only once the intent is clarified do I trigger the build-phase agents:
Archie (Architecture + Data + API): Absorbed Schema Sam and Contract Cass to own system boundaries, data modeling, and migration safety.
Ines (DevOps + SRE + Chaos): Owns everything between git push and production traffic.
Pierrot (Security + Compliance): Carries a dark-humored lens for vulnerabilities and SOC 2 compliance, with full veto power.
What this looks like in practice...
You: I want to build a CLI tool that converts markdown to PDF Cam: Let me explore that with you. Who’s the target user — developers or non-technical users? And what’s your priority: fidelity to the markdown spec, or speed of conversion? You: Developers, and speed matters more than perfect rendering. Cam: Got it. Let me probe a few more things before we commit to an approach...
The TDD Pipeline: Separating the Test from the Code
I use a strict Tara to Sato pipeline for implementation. Instead of one agent doing red-green-refactor, I separate the test-writer from the implementer to prevent “happy path” bias:
Red Phase (Tara): Tara writes the failing tests first, and only tests. She has veto power on test coverage and can block a merge if critical paths are untested.
Green/Refactor Phase (Sato): Sato, our principal SDE workhorse, receives Tara’s failing tests and writes the minimum production code to make them pass.
Engineering Through Conflict: The Adversarial Debate
For high-stakes decisions, I don’t just ask for an opinion; I trigger the Adversarial Debate Protocol.
The Workflow:
Round 1 — Opening: Archie (Architecture) and Wei (Devil’s Advocate) are invoked in parallel. Archie proposes a design while Wei attempts to break its assumptions.
Round 2 — Response: Archie provides a point-by-point response to Wei’s challenges.
Round 3 — Rebuttal: Wei provides a final rebuttal before the decision is recorded in an ADR and the debate tracked alongside it.
I saw this in action during ADR-0011 regarding structured output in our Rust (Tauri + Svelte) stack. Wei identified that a permissive JSON schema - using additionalProperties: { type: “number” }—allowed the LLM to misspell parameter names (e.g., grain_sine instead of grain_size) while the validation layer silently stripped them and used default values. Archie countered by correcting Wei’s latency estimate for grammar cache recompilation (100-300ms, not 1-3s) and successfully argued against a non-strict fallback tool. This friction transforms “vibe coding” into rigorous engineering.
The seven phases of agentic delivery
The 13-Item Done Gate and Failure-Pattern Flywheel
In a high-throughput environment, quality drift creeps in fast. My system enforces a mandatory 13-item Done Gate for every work item.
This checklist covers everything from accessibility reviews by Dani to migration safety verified by Archie. The real power, however, is the self-improving loop. At every sprint boundary, Grace (Coordination) triggers a mandatory retrospective. Agents analyze the sprint to identify recurring failure patterns. These findings generate process-improvement issues that literally block the next sprint from starting until they are resolved.
Agent-Notes: Engineering for Legibility
To operate at speed, agents must bootstrap context instantly. I’ve implemented an Agent-Notes Protocol - structured metadata at the top of every file.
// agent-notes: {
// ctx: “auth middleware for JWT validation”, // < 10 words
// deps: [”lib/crypto”, “models/user”],
// state: “stable”,
// last: “archie@2026-02-12”
// }
These headers provide a ~50-word summary, allowing an agent to understand the file’s purpose and dependencies without reading the source. It is the “map of the territory” that prevents agents from wandering in the dark.
Cospa and Context
Living in Tokyo, I often reflect on コスパ (cospa, or cost/performance). In this new era, cospa isn’t just about saving money; it’s about the efficiency of human judgment. By moving human effort to the “Planning Phase,” we ensure our judgment is applied where it has the highest leverage.
This 18-agent orchestra isn’t the starting point, it’s the destination. The bridge to this scale is built on the same fundamentals I’ve shared throughout this series: Start slow, document your bottlenecks, and treat your repository as the system of record. That’s the real job now: building the environments that build the software.
What’s the dumbest thing your AI agents keep doing that you had to build a rule for? I’d love to hear how your teams are codifying their own “Constitutions” in CLAUDE.md and beyond.


