How Skillgym Helps You Verify Agent Skills Still Work After Every Change

Authors

At Callstack, we maintain a growing set of agent skills. Things like agent-device, which gives agents a way to control mobile devices. Or agent-react-devtools, which lets them interact with React DevTools. Or Rozenite, which allows agents to interact with Rozenite plugins and inspect app state. Each skill ships as a SKILL.md file: plain markdown that tells the agent what it can do, when to do it, and how.

Maintaining them may sound simple. Rewrite a sentence here, reorder a section there, sharpen an example. But we kept running into the same quiet problem: we would make a small change, fire a prompt in the CLI, and it would look fine. Then someone would try the same skill on a different agent, and it would pick the wrong tool, skip a step, or go off on its own. The markdown looked fine, but behavior changed.

The frustrating part was that we had no way to tell when behaviour changed. There was no record of what the agent actually did, no way to compare runs, and no way to say with confidence that a change made things better or worse. It was just vibes and hope.

That is the problem Skillgym solves.

The questions a quick check will not answer

When you edit a SKILL.md, three questions matter:

  • Did the agent load this skill, and not reach for something else?
  • Did it follow the right steps: reading the right files, running commands in the expected order?
  • Was the final result still correct, without silently taking twice the tokens to get there?

You can try to answer these by firing a prompt and reading the output. But that only tells you what happened once, on one agent, while you were watching. It doesn't give you anything to compare against next time. And it also definitely doesn't catch the regression that only shows up on Codex.

What Skillgym does

Skillgym runs an agent session against a real CLI (OpenCode, Codex, Claude Code, or Cursor Agent), collects what happened, and lets you write pass/fail assertions against the result. You describe what correct behavior looks like. Skillgym tells you whether it happened, and keeps the artifacts around when it did not.

No mocks. No simulated transcripts. The agent reads the skill the same way it would in production, in a real working directory, with real tool calls going out. If it breaks, it breaks the same way your users would see it break.

What you can check

When you inspect an agent run by hand, you are already asking the right questions. You scroll through the trace looking for the skill name, checking which files were opened, making sure commands ran in the right order. Assertions are those same checks written down, so you do not have to do them manually every time.

Did the agent use the right skill?

assert.skills.has(report, "agent-device");

Did it read the right files?

assert.fileReads.includes(report, /bootstrap-install\.md$/);

Did commands run in the right order?

assert.commands.before(
  report,
  /^agent-device\s+open\b/,
  /^agent-device\s+snapshot\b/
);
assert.commands.before(
  report,
  /^agent-device\s+snapshot\b/,
  /^agent-device\s+close\b/
);

Was the final output correct?

assert.match(ctx.finalOutput(), /Settings opened/i);

Throw, and the run fails, with the exact assertion, the full session artifact, and a path to dig into. Return, and it passes. That's the whole contract.

You do not need to write all of these at once. A single assert.skills.has check is already more than a manual prompt. Start there and add more as you learn what your skill actually does.

And because assertions are just TypeScript functions, you are not limited to the built-in helpers. Skillgym passes you the full session data: every tool call, every file read, every command, the final output, token counts. If the built-in helpers do not cover what you need, you can write the check yourself.

One skill, every agent

Here is something we learned the hard way: a SKILL.md change that passes Claude Code can silently break in Codex. Behavior that works with one model's context window may not survive another. The agent that uses your skill in production might not be the one you tested with.

Skillgym runs the same suite across every configured runner (OpenCode, Codex, Claude Code, Cursor Agent) so you see the full picture before you trust the change. If one runner starts behaving differently after an edit, you will know which one, and you will have the session to show why.

Giving each run a clean slate

Some skills do not just read files. They create them, modify them, run install scripts, or depend on a specific directory structure being in place. If two runs share the same workspace, they can bleed into each other in ways that are hard to reproduce and even harder to debug.

Skillgym supports isolated workspaces: each case gets its own fresh directory before the agent starts. You can point it at a template directory to copy a full project tree into place, and run a bootstrap command before the agent touches anything. When a run fails, the workspace is kept as-is under the output directory so you can inspect exactly what the agent saw.

When a skill edit gets expensive

Some regressions do not break behavior; they inflate it. An innocent-looking restructure can make the agent loop through extra steps, re-read files it already has, or recover from confusion it should not have encountered. The result looks correct. The token bill is quietly twice what it was.

Skillgym can track token usage per case and per runner against a stored baseline. When a skill edit pushes usage past your tolerance, the run flags it. Set the baseline once, update it deliberately.

Getting started

Install as a dev dependency:

npm install -D skillgym

Add a skillgym.config.ts with at least one runner:

import type { SkillGymConfig } from "skillgym";

const config: SkillGymConfig = {
  runners: {
    "my-agent": {
      agent: {
        type: "claude-code",
        model: "claude-sonnet-4-6",
      },
    },
  },
};

export default config;

Write a first test:

// skillgym/my-suite.ts
import { assert, type TestCase } from "skillgym";

const suite: TestCase[] = [
  {
    id: "agent-device-settings-check",
    prompt: [
      "Use agent-device to open Settings,",
      'verify "Privacy" is visible with snapshot,',
      "then close the session.",
    ].join(" "),
    async assert(report) {
      assert.skills.has(report, "agent-device");
      assert.commands.before(
        report,
        /^agent-device\s+open\b/,
        /^agent-device\s+snapshot\b/
      );
      assert.commands.before(
        report,
        /^agent-device\s+snapshot\b/,
        /^agent-device\s+close\b/
      );
    },
  },
];

export default suite;

Run it:

npx skillgym run ./skillgym/my-suite.ts

Start with one runner and one case. Widen the matrix once you trust the setup. It is also worth keeping individual cases focused on a single behavior rather than testing an entire workflow in one prompt. Smaller cases are cheaper to re-run, easier to debug when they fail, and give you a clearer signal about what actually changed.

Final Words

Skills are how we extend what agents can do. We write them, we refine them, and we ship them to users who depend on them working correctly. But until now, there was no straightforward way to verify that a skill still behaves the way you intended after every edit. You just had to run it and hope for the best.

Skillgym is our answer to that. It is open source, under active development, and we would love to hear how it holds up against the skills you maintain.

As always, if you encounter an issue or have an idea for new functionality, don't hesitate to let us know via issues in the GitHub repository. Happy hacking!

Table of contents
Integrating AI into your React Native workflow?

We help teams leverage AI to accelerate development and deliver smarter user experiences.

Let’s chat

//

//
AI

We can help you move
it forward!

At Callstack, we work with companies big and small, pushing React Native everyday.

On-device AI

Run AI models directly on iOS and Android for privacy-first experiences with reliable performance across real devices.

AI Knowledge Integration

Connect AI to your product’s knowledge so answers stay accurate, up to date, and backed by the right sources with proper access control.

Generative AI App Development

Build and ship production-ready AI features across iOS, Android, and Web with reliable UX, safety controls, and observability.

AI Vibe Coding Cleanup

Turn AI-generated code from tools like Cursor, Claude Code, Codex, or Replit into production-ready software by tightening structure, validating safety, and making it stable under real-world usage.

//
Insights

Learn more about AI

Here's everything we published recently on this topic.