Your agent tested the page. Did it learn anything?
It found the right login, the right selector, the seeded state, and the one button that would've fired an expensive job. Then it said "tested and working," the chat ended, and all of it evaporated — so the next agent starts from zero. This is a guide to fixing that: turning a guided page-test into durable, layered memory instead of a disappearing transcript — and the whole protocol is packaged as a free skill you can download for your agent below.
If the page test only lives in the chat transcript, it isn't testing infrastructure. It's evaporating knowledge.
The problem: agents forget what they learned
A coding agent is genuinely good at the mechanical part of testing a page. Point it at a running app and it will discover the working account, the exact selectors, the state that needs seeding, and the actions that are dangerous. The trouble isn't capability. It's that none of it is written down anywhere durable.
So the knowledge lives for exactly one conversation. The final message reads "tested and working" — which is the least useful possible summary, because it records the verdict and throws away the method. Tomorrow a different session opens the same page and rediscovers everything from scratch, including the mistakes.
A test that doesn't update durable docs is half a test.
The good move: walk the agent through the page
Anthropic's verification-loop advice gets the hard part right. Give the agent the same tools a human tester uses — a dev server, a browser, logs, the database, auth, seeded state, tests — and let it run a loop: act, observe what failed, debug, repeat until the page actually works. In their demo, a person hand-holds the agent through one workflow, then asks it to fold what it learned into a skill, with a standing instruction to keep improving that skill when it hits a blocker.
That guided first pass is the clutch part, and it's worth saying why it works:
- You bring the product intent and the risk model — what the page is for, and what must never be clicked casually.
- The agent is excellent at recording the literal commands, selectors, state transitions, and success signals.
- Together that's a clean capture of how to test the page — far better than either of you produces alone.
None of this guide rejects that. It's a refinement. The piece the demo leaves underspecified is where each lesson should live once you've captured it.
The trap: "just put it all in the skill"
Here's the failure mode that creeps in. A global testing skill starts clean — a tidy process. Then the agent appends a page-specific warning. Then a route-specific selector. Then one repo's account names. Then one app's single most dangerous button. Each addition feels reasonable in the moment.
A few sessions later the skill is bloated, no longer portable to your other projects, and quietly wrong everywhere — full of one app's trivia that doesn't apply to the next. The thing that was supposed to make every project better now drags a single project's weirdness into all of them.
The agent should learn globally how to document a page test. It should not globally remember your app's weirdest button.
The fix: global protocol, local recipes
Classify the lesson before you store it. Three layers, each with one job:
- Global skill — the reusable behavior. "When testing a page: find or create the repo's runbook, do a guided first pass, preserve safety constraints, then update the right layer." Protocol only.
- Repo runbook — this app's facts. Its roles, routes, dev-server command, seed data, cleanup conventions, and the per-page recipes.
- Page recipe — exactly how to test one route without setting anything on fire.
The test of where something belongs is simple: would it still be true in a different repo? If yes, it's global. If it's true only for this app, it's the runbook. If it's true only for one route, it's the recipe.
| The lesson | Where it lives |
|---|---|
| How to find or create a page-test runbook | Global skill |
| How to classify a starter vs. tested recipe | Global skill |
| A reusable cleanup convention for smoke records | Global skill |
Which role tests /admin/jobs | Repo runbook |
| The dev-server command for this repo | Repo runbook |
| Test account / setup details | Repo runbook · secure docs |
| Which button submits an expensive job | Page recipe |
| A route-specific selector or flow quirk | Page recipe |
| Seed data that avoids slow setup | Fixture · page recipe |
| A one-off failure from a transient outage | Final report only |
Default to the narrower layer when you're unsure. That one habit is what keeps the global skill from becoming a repo-dependent junk drawer.
The guided first pass
When a page has no recipe yet and you're available, spend ten minutes driving. This is the part that turns into everything else.
"Unsafe actions" are not a footnote. The button that launches a batch factory or sends a real email is exactly what the next agent most needs to be warned about. Write the don't-click list as carefully as the do-this list.
The page-recipe template
Every maintained page gets a section like this in the repo runbook. Status is honest: starter means the route exists but has no real recipe yet; tested means a path was actually exercised. Never claim tested for something you only looked at.
## `<route-or-workflow>` Status: starter | guided | tested | needs repair Purpose: what the page is for, in one line Role/account: who owns it Preconditions: what must be true first (seed, auth, flags) Safe actions: what an agent may do during a smoke test Unsafe actions: what it must NOT click without explicit intent Verification steps: 1. exact, repeatable steps another agent can follow Cleanup: how to remove anything the test created Checkpoint seed: durable state that avoids slow/flaky setup Selector lessons: the quirks future agents will trip on Last verified: date · environment · evidence
This template is the missing artifact in most setups. It's agent memory with a source-of-truth path — something a future session can read, trust, and repeat.
The self-improvement rule, narrowed
"Let the skill improve itself" is good advice with one missing clause: improve which layer. Without that, self-improvement just becomes a faster way to pollute the global skill. Use the same classification every time a blocker shows up:
- Reusable across projects → update the global skill.
- Specific to this repo → update the repo runbook.
- Specific to one page → update that page recipe.
- Temporary or speculative → leave it in the final report; don't persist it yet.
Self-improvement without classification becomes skill pollution.
What it looks like in a real repo
A working example from RISTOR3D, a production app with auth, queues, and expensive media generation. Its runbook opens with global rules — which role owns which surface, prefer local verification, never submit batch factories during a smoke test — then a starter section for every route, and real recipes for the pages that have been tested. Here's the shape of the recipe for its source-material importer, /plan/lesson/import:
## `/plan/lesson/import` Status: tested · Role: producer Safe actions: - open the page; click the `fact lists` tab - create one small list titled `ZZ smoke <timestamp>` - confirm it appears in the saved-list stack - verify `run factory` / `queue ready` render — but don't submit Unsafe actions: - do NOT click `run factory`, `queue ready`, or any overnight/batch action unless explicitly intended - do NOT overwrite an existing list's facts Selector lessons: - active tab is `.lesson-import-layout[data-active-tab]` — don't assume the initial tab - while drafts are planning, `[data-processing="true"]` hides the batch form; the fact-list panel stays available
Notice what's captured: the cleanup convention (ZZ smoke prefixes so test records are obvious and removable), the exact controls to leave alone, and the selector quirk that would otherwise cost the next agent twenty minutes. That's the difference between "tested and working" and a recipe someone can actually run.
Get the skill
Below is the same page-testing-memory protocol packaged for each major agent, in that tool's native format. Same behavior everywhere; only the file layout changes. Grab the one you use — or take them all.
Codex App skill
Full skill folder with the storage-boundary reference and Codex agent manifest.
Project skill
Unzips to .claude/skills/ at a repo root for project scope.
Uploadable skill
The skill folder, ready to upload as a custom skill in Claude Desktop / Claude.ai.
Workspace skill
Unzips to .agents/skills/ in your Antigravity workspace root.
Markdown variant
Single-file skill for the CLI. Drop into .agents/skills/ or your global skills dir.
All bundles
Every format above plus a README mapping each bundle to its placement.
The point was never the test. It was the memory.
A guided first pass is training-data capture. You drive once, the agent records the method, and the method goes into the layer where it stays true. After that, the attention you were spending on routine QA goes back to judgment — and the next agent walks in already knowing how to test the page. Build the runbook, keep the global skill clean, and let the recipe carry what was learned.