Field guide · agent testing

Your agent tested the page. Did it learn anything?

It found the right login, the right selector, the seeded state, and the one button that would've fired an expensive job. Then it said "tested and working," the chat ended, and all of it evaporated — so the next agent starts from zero. This is a guide to fixing that: turning a guided page-test into durable, layered memory instead of a disappearing transcript — and the whole protocol is packaged as a free skill you can download for your agent below.

A printed test runbook with colored index tabs and a glowing teal sticky note on a dark desk beside a laptop showing a dim terminal.

If the page test only lives in the chat transcript, it isn't testing infrastructure. It's evaporating knowledge.

01

The problem: agents forget what they learned

A coding agent is genuinely good at the mechanical part of testing a page. Point it at a running app and it will discover the working account, the exact selectors, the state that needs seeding, and the actions that are dangerous. The trouble isn't capability. It's that none of it is written down anywhere durable.

So the knowledge lives for exactly one conversation. The final message reads "tested and working" — which is the least useful possible summary, because it records the verdict and throws away the method. Tomorrow a different session opens the same page and rediscovers everything from scratch, including the mistakes.

A test that doesn't update durable docs is half a test.

02

The good move: walk the agent through the page

Anthropic's verification-loop advice gets the hard part right. Give the agent the same tools a human tester uses — a dev server, a browser, logs, the database, auth, seeded state, tests — and let it run a loop: act, observe what failed, debug, repeat until the page actually works. In their demo, a person hand-holds the agent through one workflow, then asks it to fold what it learned into a skill, with a standing instruction to keep improving that skill when it hits a blocker.

That guided first pass is the clutch part, and it's worth saying why it works:

You bring the product intent and the risk model — what the page is for, and what must never be clicked casually.
The agent is excellent at recording the literal commands, selectors, state transitions, and success signals.
Together that's a clean capture of how to test the page — far better than either of you produces alone.

None of this guide rejects that. It's a refinement. The piece the demo leaves underspecified is where each lesson should live once you've captured it.

03

The trap: "just put it all in the skill"

Here's the failure mode that creeps in. A global testing skill starts clean — a tidy process. Then the agent appends a page-specific warning. Then a route-specific selector. Then one repo's account names. Then one app's single most dangerous button. Each addition feels reasonable in the moment.

A few sessions later the skill is bloated, no longer portable to your other projects, and quietly wrong everywhere — full of one app's trivia that doesn't apply to the next. The thing that was supposed to make every project better now drags a single project's weirdness into all of them.

The agent should learn globally how to document a page test. It should not globally remember your app's weirdest button.

04

The fix: global protocol, local recipes

Classify the lesson before you store it. Three layers, each with one job:

Global skill — the reusable behavior. "When testing a page: find or create the repo's runbook, do a guided first pass, preserve safety constraints, then update the right layer." Protocol only.
Repo runbook — this app's facts. Its roles, routes, dev-server command, seed data, cleanup conventions, and the per-page recipes.
Page recipe — exactly how to test one route without setting anything on fire.

The test of where something belongs is simple: would it still be true in a different repo? If yes, it's global. If it's true only for this app, it's the runbook. If it's true only for one route, it's the recipe.

Where each lesson belongs
The lesson	Where it lives
How to find or create a page-test runbook	Global skill
How to classify a starter vs. tested recipe	Global skill
A reusable cleanup convention for smoke records	Global skill
Which role tests `/admin/jobs`	Repo runbook
The dev-server command for this repo	Repo runbook
Test account / setup details	Repo runbook · secure docs
Which button submits an expensive job	Page recipe
A route-specific selector or flow quirk	Page recipe
Seed data that avoids slow setup	Fixture · page recipe
A one-off failure from a transient outage	Final report only

Default to the narrower layer when you're unsure. That one habit is what keeps the global skill from becoming a repo-dependent junk drawer.

05

The guided first pass

When a page has no recipe yet and you're available, spend ten minutes driving. This is the part that turns into everything else.

Step 01 / 08

Open the page together

Start on a local or staging server, never production. You want somewhere a mistake is cheap and a real submission can't reach customers, queues, or paid APIs.

Step 02 / 08

Say what the page is for

Give the agent one sentence of intent, and name the role or account that owns the page. That single line is what lets it tell a routine control apart from a dangerous one.

Step 03 / 08

Direct one action at a time

Don't hand over the whole flow and walk away — narrate it action by action. The agent learns the real path by observing each transition, not by guessing the sequence.

Step 04 / 08

Mark the danger as you go

As you reach each control, say whether it's safe, expensive, destructive, production-sensitive, or externally visible. This running commentary is the raw material for the "unsafe actions" list the next agent will depend on.

Step 05 / 08

Name the success signal

Before moving on, make the agent state the specific visible thing that proves the step worked — a row that appears, a status that flips, a toast that fires. "It worked" is not a signal; a named element is.

Step 06 / 08

Record selectors and state

Have it capture the actual selectors and state changes as they happen, not reconstruct them from memory afterward. These are the details that make a recipe repeatable instead of roughly accurate.

Step 07 / 08

Stop before dangerous submissions

Confirm the dangerous controls render and behave correctly, then stop — don't submit them unless you explicitly mean to. Wired-and-correct is a complete result; you don't have to fire the batch job to prove the button works.

Step 08 / 08

Convert the pass into a recipe

Before the session ends, write what you learned into the repo runbook using the page-recipe template. If it only lives in the chat, the next agent starts from zero — which is the whole problem this solves.

First-class documentation

"Unsafe actions" are not a footnote. The button that launches a batch factory or sends a real email is exactly what the next agent most needs to be warned about. Write the don't-click list as carefully as the do-this list.

06

The page-recipe template

Every maintained page gets a section like this in the repo runbook. Status is honest: starter means the route exists but has no real recipe yet; tested means a path was actually exercised. Never claim tested for something you only looked at.

## `<route-or-workflow>`

Status: starter | guided | tested | needs repair

Purpose:        what the page is for, in one line
Role/account:   who owns it
Preconditions:  what must be true first (seed, auth, flags)

Safe actions:    what an agent may do during a smoke test
Unsafe actions:  what it must NOT click without explicit intent

Verification steps:
  1. exact, repeatable steps another agent can follow

Cleanup:          how to remove anything the test created
Checkpoint seed:  durable state that avoids slow/flaky setup
Selector lessons: the quirks future agents will trip on
Last verified:    date · environment · evidence

This template is the missing artifact in most setups. It's agent memory with a source-of-truth path — something a future session can read, trust, and repeat.

07

The self-improvement rule, narrowed

"Let the skill improve itself" is good advice with one missing clause: improve which layer. Without that, self-improvement just becomes a faster way to pollute the global skill. Use the same classification every time a blocker shows up:

Reusable across projects → update the global skill.
Specific to this repo → update the repo runbook.
Specific to one page → update that page recipe.
Temporary or speculative → leave it in the final report; don't persist it yet.

Self-improvement without classification becomes skill pollution.

08

What it looks like in a real repo

A working example from RISTOR3D, a production app with auth, queues, and expensive media generation. Its runbook opens with global rules — which role owns which surface, prefer local verification, never submit batch factories during a smoke test — then a starter section for every route, and real recipes for the pages that have been tested. Here's the shape of the recipe for its source-material importer, /plan/lesson/import:

## `/plan/lesson/import`
Status: tested · Role: producer

Safe actions:
  - open the page; click the `fact lists` tab
  - create one small list titled `ZZ smoke <timestamp>`
  - confirm it appears in the saved-list stack
  - verify `run factory` / `queue ready` render — but don't submit

Unsafe actions:
  - do NOT click `run factory`, `queue ready`, or any
    overnight/batch action unless explicitly intended
  - do NOT overwrite an existing list's facts

Selector lessons:
  - active tab is `.lesson-import-layout[data-active-tab]`
    — don't assume the initial tab
  - while drafts are planning, `[data-processing="true"]`
    hides the batch form; the fact-list panel stays available

Notice what's captured: the cleanup convention (ZZ smoke prefixes so test records are obvious and removable), the exact controls to leave alone, and the selector quirk that would otherwise cost the next agent twenty minutes. That's the difference between "tested and working" and a recipe someone can actually run.

Ship it · the skill

09

Get the skill

Below is the same page-testing-memory protocol packaged for each major agent, in that tool's native format. Same behavior everywhere; only the file layout changes. Grab the one you use — or take them all.

Codex

Codex App skill

Full skill folder with the storage-boundary reference and Codex agent manifest.

→ install as a Codex skill folder

Download .zip

Claude Code

Project skill

Unzips to .claude/skills/ at a repo root for project scope.

→ unzip at your repo root

Download .zip

Claude Desktop

Uploadable skill

The skill folder, ready to upload as a custom skill in Claude Desktop / Claude.ai.

→ upload the .zip as a skill

Download .zip

Antigravity

Workspace skill

Unzips to .agents/skills/ in your Antigravity workspace root.

→ unzip at your workspace root

Download .zip

Antigravity CLI

Markdown variant

Single-file skill for the CLI. Drop into .agents/skills/ or your global skills dir.

→ a single .md file

Download .md

Everything

All bundles

Every format above plus a README mapping each bundle to its placement.

→ one zip, all agents

Download all .zip

The point was never the test. It was the memory.

A guided first pass is training-data capture. You drive once, the agent records the method, and the method goes into the layer where it stays true. After that, the attention you were spending on routine QA goes back to judgment — and the next agent walks in already knowing how to test the page. Build the runbook, keep the global skill clean, and let the recipe carry what was learned.