May 31, 2026 · 9 min read

Six Primitives for a Code Factory

Ramp’s background agent writes roughly 30% of every pull request the company merges. Stripe merges more than a thousand PRs a week that contain no human-written code at all. WorkOS built a system that notices a Sentry error, opens a branch, debugs it, and hands you a PR, without anyone asking it to.

I don’t have Stripe’s pre-warmed devbox fleet or a platform team to run a factory. What a regulated shop has instead is a small team and a change advisory board.

The factory is still buildable, but only if you stop seeing it as a product you buy and start seeing it as a small number of primitives you assemble, and only if you understand why each one costs more inside a regulated bank than it did at the three companies that already shipped it.

The Harness Is the Moat, Not the Model

Michael Grinich said it plainly: the product is the harness, not the model. The LLM is the engine. The harness (sandbox, context, feedback loops, review gate) is the chassis and wheels that actually move you. The engine is rented: you don’t own Opus or GPT-5.5, you call them, and next quarter you call something better. What you own is the harness. And every time the engine changes, you rebuild the harness. You can’t just drop a new model into old scaffolding and expect it to hold.

That’s the argument from Prompting Split Into Four Skills. Infrastructure Is the Fifth.: the 10x gap was never better prompts, it was encoding the disciplines as infrastructure that runs without you. The prompt is disposable; the harness compounds.

Ramp, Stripe, and WorkOS are the existence proof. Ramp’s Inspect reached ~30% of merged PRs in a couple of months, without a mandate. Stripe’s Minions clears more than a thousand merged PRs a week. WorkOS’s Horizon reacts to Linear tickets, GitHub merges, and Sentry errors and opens its own PRs. None of them built a smarter model. They built a better harness around the same models everyone else can call.

Why a Factory, and Why a Small Team Can Build One

The factory doesn’t automate typing. Writing code was never the bottleneck. In a regulated bank it’s roughly 25–30% of an engineer’s day, the point I made in The Bottleneck Was Never the Code. The value is closing the loop: running the tests, reading the telemetry, proving the change works before a human looks at it. Ramp describes Inspect as having “all the context and tools needed to prove it.” That proof step is the work.

Which is why a small team can own one: the harness is leverage. You build the loop once and it runs on every ticket, every error, forever. But small team is not small effort. Each primitive below is real engineering (sandbox orchestration, a context server, a verification pipeline, an identity model) not a prompt.

The Six Primitives

The diagram is the whole post in one frame: six primitives as a loop, with the governance band wrapping all of them. Everything below annotates it.

The six primitives of a regulated code factory: Trigger Surface, Execution Environment, Context Engine, Feedback Loop, and Human Review arranged as a closed loop, wrapped by an Identity & Delegation governance ring

Primitive 1: The Execution Environment

The agent needs somewhere to run that looks exactly like where a human engineer runs. Stripe runs “devbox” machines identical to the ones humans use, pre-warmed to spin up in 10 seconds, isolated from production and the internet so hundreds run in parallel safely. WorkOS chose Cloudflare Containers to run “the full monorepository stack, not a partial environment,” with “strong egress controls via Worker proxies to prevent prompt injection/exfiltration.”

In a bank, that isolation is not an optimization; it’s the starting posture. The regulated version is Azure-hosted, network-isolated, data-classified, with a prebuilt image of the full stack and no open path to the internet. The cost is structural: at Stripe, internet isolation is a safety feature they chose; in a regulated shop it’s a control you have to prove: to security, to the network team, to the examiner who asks where the agent’s traffic goes. The sandbox isn’t harder to build. It’s harder to clear.

Primitive 2: The Context Engine

A model with no context fumbles. Stripe built a central server called Toolshed hosting more than 400 MCP tools across internal systems; WorkOS built a custom MCP context server that codifies engineer workflows for “faster convergence on the right workflow.”

The context surface usually already half-exists: codebase, Confluence, Jira, ServiceNow, Datadog. Teams have often built working skills against most of these: a Datadog investigation skill, Jira/Confluence connectors are the raw material of a context engine. The factory is what happens when you stop running them by hand and wire them into the loop.

The cost is per-connector. At Stripe, adding a tool to Toolshed is an engineering task. At a bank, every connector touching a system of record is a data-handling review and an access-scope decision. A logs-based skill only works if PII is masked at the application layer before logs ever land there (not a convenience, a precondition), and that holds for every source you want in context.

Primitive 3: The Feedback Loop

This is the primitive that separates a factory from a code generator: the agent verifies its own work before handing it over. Stripe runs heuristic lint in under five seconds, then CI selectively against the relevant slice of more than three million tests, and caps itself at “at most two CI runs” per task. WorkOS verifies with linting, building, and automated tests. Both treat green tests as the gate, not a suggestion. Infrastructure engineers have run this exact pattern for a decade (declare desired state, let an executor converge, reconcile the drift) which is the argument I made in The Reconciliation Loop. The loop is not new. Only the executor is.

In a bank, “done” has more gates. Add SAST and SCA scans. Add change-record validation. Add the CAB submission artifact (risk assessment, rollback plan) as part of the definition of done, not paperwork that happens later.

And not every feedback loop is a machine. A regulated change still runs through formal QA cycles and User Acceptance Testing: humans exercising the change against business requirements and edge cases the test suite never encoded. The agent’s automated verification feeds those loops; it doesn’t close them. The factory can get a change to the door of QA far faster, but QA and UAT stay human-driven gates, and they’re where a large share of the calendar still goes.

Primitive 4: The Trigger Surface

Something has to start the work. Horizon is webhook-driven: Linear status changes, GitHub merges, Datadog and Sentry signals. Ramp triggers from Slack, a Chrome extension, and PR comments. The obvious ones are simple in shape: a Jira ticket moves to ready, a ServiceNow incident opens, a monitor fires.

The cost is that in a bank, every trigger is also a control point. What an agent is permitted to react to autonomously is a risk decision, not a config toggle. An agent that opens a PR off a Jira ticket is one posture; an agent that reacts to a production Sentry error and starts changing a payment flow is a completely different one. At Stripe the trigger surface is about coverage; at a bank it’s about scope of authority. You design the triggers and the authorization model in the same breath, or you don’t ship it.

Primitive 5: Human Review, The Legacy Dual-Track

WorkOS is blunt: “a human is always in the loop when the agent hands off the pull request,” and review is “frequently the bottleneck” even after everything else is automated. Ramp creates PRs with a human’s GitHub token, not an app token, so nothing merges without a person in the path.

Here’s where the bank diverges hardest. At those companies the review bottleneck is friction to optimize down. In a regulated bank, the bottleneck is the product: the SDLC, the CAB, the architecture review board, the SOX change boundary, the SR 11-7 model-risk obligations. Those aren’t inefficiencies nobody fixed; they’re the controls that let the institution operate at all.

So the factory doesn’t replace the legacy way of building software. It feeds it. The agent’s output has to arrive in a shape the existing SDLC can consume: flowing into CAB with its risk assessment attached, into the ARB with its design rationale, into the audit trail with its provenance intact. The factory runs on one track, the regulated SDLC on the other, and the whole engineering problem is the reconciliation between them. That’s where most of the time goes, not the model, not the sandbox.

Primitive 6: Identity and Delegation

Grinich’s sharpest line: “agents are the new users.” Once an agent is opening PRs and reacting to errors, it stops being a tool and becomes an actor, and actors need identity. Ramp’s user-token choice was exactly this instinct: the agent acts as someone, and that someone is accountable.

In a bank this is the present obligation, not the future. An agent that opens a PR needs a governed non-human identity, an audit trail that survives an examiner’s questions, and a clear human it delegates for. I wrote about this gap in Who Issued the Agent?: your Okta federation governs humans, your Copilot agents got Entra Agent IDs, and the two planes don’t talk. You can’t run autonomous agents against systems of record without first answering: who is this agent, who issued it, who is accountable, and how do I prove it after the fact.

Grinich’s auth.md, letting agents register for services with trust riding on a single verified-identity assertion, points at where this goes. But be exact: it’s a spec, not a standard. Grinich says so himself: it isn’t ratified, and only a handful of providers have adopted it. In the enterprise no agentic-identity standard has won yet, and a bank doesn’t bet its identity layer on a contender that may not survive the year. It builds the governance first, because that’s the part the regulator already cares about.

Why It Takes Longer Here

Every primitive Ramp, Stripe, and WorkOS built as an engineering task, a regulated shop builds as engineering plus controls:

The sandbox. Its egress isolation has to be proven to security before it runs.
The context engine. Every source is a data-classification and access-review decision.
The feedback loop. It must emit a reviewable change package with SAST, SCA, and CAB artifacts, and still pass human QA and UAT.
The trigger surface. Every trigger is a scope-of-authority decision.
Human review. Not a bottleneck to optimize away; it’s the regulated SDLC, and it’s mandatory.
Identity. Not a token choice; it’s non-human identity governance the examiner will ask about.

And the engine itself. When Stripe ships a better model into Minions, they rebuild the harness in days. Swapping engines in a regulated shop is not just a rebuild, it’s a re-clear: SR 11-7 treats the model as a model risk, so a new engine means a new risk assessment plus a security re-clearance. The aircraft carrier doesn’t turn fast, but slower is not the same as never.

The State of the Build, and the Fork

This is a blueprint, not a victory lap. Most of these primitives tend to exist as point tools first: a Datadog skill, Jira and Confluence connectors, a GitHub Actions workflow that runs Claude against every diff. What’s unfinished is the integration that turns them into a loop and the governance wrapper that makes it safe to run autonomously. And the harder half isn’t the primitives; it’s the organization. Sol Rashidi’s data is the warning label: 88% of enterprise AI proofs of concept never reach production, 70% of failures organizational, not technical. You can build all six correctly and still fail if the institution can’t absorb the change.

Which leaves a fork. The factory is coming to regulated industries the way every platform shift did: late, carefully, on the institution’s terms. The bank that builds these six primitives, even slowly, ends up owning its harness, and the harness is the thing that compounds. The one that waits for a vendor to sell it a finished factory ends up running someone else’s, shaped to someone else’s workflow, re-priced on someone else’s schedule. The engine is rented either way. The only question is who owns the chassis.

If you’re assembling something like this inside a regulated shop, or you think I’ve got one of the six primitives wrong, I’d genuinely like to compare notes. The companion read is The Bottleneck Was Never the Code: why the factory’s value is the loop, not the typing.

Find me on X @orestesgarcia or LinkedIn /in/setsero.