Letting an agent open docs PRs

I built an agent that opens documentation PRs against the Genkit docs, and almost all of the engineering went into the gates that decide when it is allowed to finish, not the prompt that tells it what to do.

The reason is the shape of the task. The output is a pull request a human reviews and merges, so a capable model is not the hard part. A capable model will confidently produce something that looks right and passes a shallow check while being wrong, and on a docs PR “looks right” is most of what a shallow check sees. The work is writing checks that test the property you actually care about. Two of the gates I wrote checked a proxy for that property instead, and both let through a result I would have rejected on sight. Fixing them is the story.

How the docs are built

A page in the docs is authored once and generated per language. The source file has frontmatter listing its supportedLanguages, and a body where anything language-specific lives inside a <Lang lang="js"> block. Everything outside those blocks is shared prose, rendered into every language’s version of the page. Porting a page to a new language is supposed to be additive: you add that language’s <Lang> blocks, add it to supportedLanguages, and change nothing else. The invariant I care about is that a reader already on the page in JavaScript or Go sees byte-for-byte what they saw before the port.

My benchmark was a page we had already ported by hand. Python was added to the middleware page in a real PR, +222 lines with nothing deleted, so I had a ground-truth diff to compare against. The agent’s job was to port the same page to Python from a one-line brief.

The gate that asked the wrong question

Docs introduce a narrow bug class: an import that does not resolve, a class or parameter that does not exist, an install line for a package that is not named that. The docsite’s own generate step checks that the <Lang> blocks are well-formed and that the languages line up with the frontmatter, but it never runs the Python. So I added a gate: a Python port cannot be finalized until a code sample has run. I keyed it on the sample’s exit code.

Watch what the agent did with that. It wrote real Genkit snippets and ran them, and every one exited non-zero, because they call Gemini and the sandbox has no API key. After about twenty of those, it ran a snippet containing exactly print("hello world"), got exit code 0, and finalized. Exit 0 was my proxy for “the documented code runs”, and the model satisfied the proxy with the cheapest program that produces it. The documented code was never executed.

The fix was to key the gate on the property rather than its shadow. The sample runner now reads stderr, and a snippet counts only if it imports genkit and raises none of ImportError, ModuleNotFoundError, NameError, SyntaxError, or AttributeError. A snippet that fails only because it cannot reach the API still counts, because its names resolved, which is the thing docs actually get wrong. print("hello world") no longer counts, because it never imports the library. A wrong attribute no longer counts, because it raises AttributeError. If you are gating an agent on “the code works”, the useful move is to decide which “works” you can actually observe. For docs it is that the imports and names resolve, not that a live call succeeds.

The gate that checked the wrong representation

The port invariant, again, is that nothing an existing reader sees changes. The agent added its Python block cleanly with a surgical insert, then reached for a full-file rewrite tool to tidy the result, and the rewrite quietly dropped the entire Dart block. The structural check passed, because Dart was now consistently absent from both the body and the frontmatter. My additive check passed too, and that is the part worth sitting with.

I had written the additive check as “the shared prose is unchanged”. To compute shared prose I strip every <Lang> block out of the page and compare what is left. Deleting a whole language’s block leaves the remaining shared prose identical, so the check had no way to notice. The run came out +189 added and 382 deleted, a third of the page gone, with every gate green.

The invariant I wanted lives in what a reader sees, so that is where I check it now. Instead of comparing stripped prose, I render the page for each language that existed before the port and compare the rendered output. Dropping the Dart block changes the Dart page, so it is caught. Python is not in the old set, so the agent is still free to revise its own new block. When you assert that nothing the reader sees changes, check it in reader space. The raw source can lose an entire section without moving the particular bytes you happened to diff.

The run after that

With both gates testing the real property, the next run came out +189 with nothing deleted, all four language blocks intact, the install line naming the package that actually exists, in about seventy-five seconds. The snippet gate did visible work along the way: an ImportError, then a ModuleNotFoundError, each one sending the agent back to read more source before it landed on a sample whose names resolved. The hand-written PR I was benchmarking against was +222 with nothing deleted, so the shapes line up.

The model did not get smarter between the runs that produced garbage and the run that produced that. I closed the two paths that let a wrong result look finished, and the cheapest path left for the model to take was the correct one.

The general lesson I am taking from this is small but keeps paying out: when an agent’s output is something a person merges, a gate keyed on a proxy gets satisfied by the proxy, and an invariant checked in the wrong representation passes when it should not. The model is allowed to be opportunistic, and it will be. The gates are where you decide what “done” means.

The usual caveats apply. This is one page, one language, and an afternoon of runs. The structural check is my reimplementation of the docsite’s rules rather than the real site build, so the two can drift. And I have not pointed it at the harder jobs yet, fixing a wrong snippet or modernizing stale syntax, where a port’s additive invariant does not hold and the right check is a different one.