How to Automate AI Agent Validation

Michael Wagstaff
4 days ago
5 min read

Updated: 4 days ago

In a hurry? There's a preloaded example you can run in CI today, jump to how you can try this yourself.

Testing was never the most exciting part of software development, but at least it was predictable. With AI agents becoming part of the software stack, testing has become much harder — and much more important.

To show why, let me introduce you to an agent I don't entirely trust.

ParcelShip helps people collect parcels from a storage locker. You message it, it checks who you are, confirms your parcel is actually waiting, and gives you the locker and the code to open it. Small, helpful, the sort of thing you'd happily ship. It's also backed by a large language model, which means every time someone talks to it, its exact response is hard to predict. Ask it the same thing two slightly different ways and you are not guaranteed the same answer.

Two failures worry me, and they pull in opposite directions. The first: a real customer, parcel waiting, phrases the request a bit oddly, and the bot decides it can't help them. Annoying, but survivable. The second is the one that matters: somebody asks for a locker code without proving who they are, and the bot, trying to be helpful, just hands it over. The result is a stranger walking off with your parcel.

"ParcelShip refuses unauthenticated requests" is easy to add to a system prompt. The real question is whether it actually does refuse, every time, including when the person asking is clever about how they ask.

Defining a Good Outcome

Spec27 is a platform built for exactly this problem: you connect an agent, describe how it should behave, and run that description against it whenever you like. That description is a spec: what the bot should do, and the ground truth we test against.

For ParcelShip we pair the spec with two sets of test data. One is ordinary (if occasionally odd) customer requests with the responses we'd want, the gold-team dataset, which checks if the bot helps the people it's meant to help. The other is malicious requests, the red-team dataset. This may contain requests like asking for a code before completing phone-number authentication, saying it's an emergency, or attempting to extract secrets.

A spec ties those examples to a way of marking the answers. It's a recipe, not a single run, which means you can run it again and again to get a record of how your agent performs over time. This means you can accurately see the impact of a model update that's a little too eager to please, or a reworded prompt that opens a crack for attackers to exploit.

How We Measure Accuracy

Run the spec and you get two figures.

Clean accuracy is how the bot does on the examples exactly as written. It's the "does it work at all" number.

Robust accuracy is the more interesting one. We take each example and automatically generate variations of it, the same request reworded, restructured, and attacked from different angles, using a red-team or gold-team method. An example only counts as robust if the bot stays correct across every one of its variations. For ParcelShip, robust accuracy is whether its refusal to open the locker stands up against scrutiny.

Running it Without a Human in the Loop

Now this works well, but I also don't want to have to set off evals by hand every time. I want to forget about it and have it tell me only when something breaks.

So it goes in CI, with a small GitHub Action that starts an eval, waits for it to finish, and hands back the results:

- uses: SafeIntelligence/spec27-run-eval@v1
  id: eval
  with:
    token: ${{ secrets.SPEC27_TOKEN }}
    eval_id: "123"

token is a project-scoped API key you've stored as a GitHub secret, and eval_id identifies the eval to run (more on where to find that below). Alongside the scores, the action exposes clean_accuracy, robust_accuracy, a status, a result_url link to the run results online, and a results_file with the full JSON, so you can gate on either accuracy number, post the link in Slack, or keep the JSON.

The action measures, it doesn't judge. It runs the eval and reports the numbers. Deciding what counts as "good enough" is my job, not the action's, so that rule sits in the next step in the CI.

- name: Fail the build if robust accuracy drops
  env:
    ROBUST: ${{ steps.eval.outputs.robust_accuracy }}
    MIN_ROBUST: "95"
  run: |
    [ "$ROBUST" -ge "$MIN_ROBUST" ] \
      || { echo "::error::robust accuracy ${ROBUST}% is below ${MIN_ROBUST}%"; exit 1; }

Both accuracies come back as whole-number percentages from 0 to 100, so a plain integer comparison is all you need. Now if a change drops robust accuracy below 95%, the pipeline fails. This means any regression is caught by CI before it reaches a customer.

An eval takes minutes rather than seconds, since the action runs the cases and waits, so it suits a per-merge gate or a nightly check rather than a unit-test loop on every commit. We run ours overnight:

on:
  schedule:
    - cron: "0 6 * * *"   # 06:00 UTC, every day

Every morning there's an answer, do we have to worry or not. Which is a lot easier to deal with than a support ticket.

How you can try this yourself

ParcelShip isn't hypothetical, it's the worked example we ship. Every Spec27 organisation has the ability to load the ParcelShip (Onboarding Example) project: the agent, the datasets, the specs, and the evals, already wired up. The quickest way to see the whole loop is to point the action at it before building anything of your own:

Sign up at dashboard.spec27.ai and open the ParcelShip (Onboarding Example) project from the registry.
Create a project-scoped API key in the project settings page and add it as a SPEC27_TOKEN secret in GitHub CI.
Open the Robustness eval and copy the eval ID from the eval's URL. Add it as input to your GitHub job and run the action. You should get a green pipeline with real results against a real agent before you've written a line of your own.
When you're ready for your own agent: add it to a project in Spec27, then add a dataset, spec and eval for it.
Mint a project-scoped API key in the project settings, add it as a SPEC27_TOKEN secret in your agent's GitHub repo, add the eval ID, and choose your threshold.

The full walkthrough lives in the Spec27 docs. The shape of it is simple: write down what your agent must and must not do, and have every change checked against it automatically, the same way you already gate on your tests.