1 min read

6/19/2026

Link copied to clipboard!

Share on X

Share on Facebook

Copy Link

0:00/

Listen to Article ()

Building Mobile QA Agents With Vercel Eve

Authors

Michał Pierzchala

Principal Engineer

Callstack

Mike Grabowski

CTO & Founder

Callstack

Vercel recently released Eve, a framework for building AI agents. They say it’s like Next.js but for agents, so let’s give it a shot.

A while back, for the purpose of demonstrating the concept of “agentic QA”, which can greatly enhance manual quality assurance in mobile apps, we’ve built a sample QA agent that runs in EAS Workflows and uses agent-device to control and access iOS simulators and Android emulators in a token-efficient way. It runs on every pull request, infers the verification plan from the PR metadata, and posts the result back to GitHub as a comment with relevant evidence, such as screenshots. This gives us extra level of trust for the change, as the code was not only read, but also executed on a real device.

This agent is built with AI SDK (also from Vercel) and we couldn’t help but migrate it to the new framework. Let’s see it in action.

Thanks to eve, instead of hiding the agent inside one large TypeScript loop, the implementation becomes a small project:

scripts/agent-qa/eve/
├── agent/
│   ├── agent.ts
│   ├── instructions.md
│   └── tools/
│       └── agent_device.ts
├── package.json
└── tsconfig.json

You can read that tree and understand the agent before it runs. The model is configured in one file. The operating rules live in Markdown. The one custom capability the model can call is a typed tool. The outer runner starts Eve, passes the EAS context into a session, waits for structured output, and shuts the server down.

That is the practical value here. Eve turns the agent from "some code that calls a model" into a reviewable unit of software.

What Eve is

Eve is a TypeScript framework for durable agents. Vercel describes it as filesystem-first: the agent is authored as files under an agent/ directory, and Eve discovers those files at build/runtime.

The common pieces are familiar:

agent/
├── agent.ts
├── instructions.md
├── tools/
├── skills/
├── channels/
├── schedules/
└── subagents/

The important part is not the naming convention by itself. The important part is that the convention separates concerns that usually get mixed together:

instructions.md is the always-on behavior contract.
agent.ts chooses the model and runtime config.
tools/ contains typed functions the model can call.
skills/ can hold Markdown playbooks loaded when needed.
channels/ connect the same agent to HTTP, Slack, Discord, or another entrypoint.
schedules/ let agents run recurring work.
subagents/ holds specialized agents that report to main agent.

Eve also brings durability into the model loop. A session can span more than one request, stream progress, call tools, pause for human input or approval, and resume after a process restart or redeploy. This is one of my favorite features, as agents do not always finish in one clean request.

Eve is still beta, so this is not a "standard pattern" yet. The APIs may change. But the direction is useful: make the agent understandable from its filesystem, then run it with a runtime that treats multi-step work as a first-class problem.

Why Eve is helpful for mobile QA

Mobile QA agents have a different failure profile than code-only agents.

A code agent can often produce a useful result by reading files and running tests. A mobile QA agent has to deal with a running app. It needs to know whether the correct app is foregrounded. It needs to distinguish the app under test from automation infrastructure. It needs to choose stable selectors or element refs. It needs to capture screenshots, not just text snapshots. It needs to decide when a failure is a product issue and when it is a device/session/tooling issue.

That is why agent-device exists. It gives agents a command surface for iOS and Android automation: launch apps, inspect snapshots, press elements, fill inputs, capture screenshots, and work across simulators, emulators, and physical devices.

But a command surface is only one part of the system.

For PR QA, we also need a pipeline:

GitHub pull request
  -> EAS Workflow
  -> fingerprint and reuse or build native artifacts
  -> repack JavaScript bundle
  -> boot Android emulator and iOS simulator
  -> install and launch the app
  -> run the QA agent
  -> save report artifacts
  -> post a GitHub comment with status and screenshots

The EAS workflow in this repo already handles the platform side. It calculates fingerprints, reuses compatible builds when possible, falls back to fresh Android and iOS simulator builds, downloads the artifacts, provisions the device, runs the agent, and posts one combined PR comment.

The Eve migration simplifies the agent layer inside that pipeline without changing the platform side.

Before Eve: one large tool loop

Before this PR, the QA agent lived in scripts/agent-qa/index.ts as one large AI SDK ToolLoopAgent setup.

That file owned almost everything:

environment parsing
PR context construction
agent-device command execution
model instructions
tool schemas
screenshot listing and upload
report rendering
guardrails around missing screenshots and invalid blocked reports
final artifact writing

That worked pretty well for our use simple case. In reality, this agent may grow quickly.

The agent's behavior is harder to review because the prompt is embedded in a TypeScript array. The tools are not separate modules. The runner has to know too much about the agent's internal tool loop. Small changes to agent behavior and small changes to CI lifecycle sit next to each other. And they require a lot of context to press it all together. Nothing that we could refactor for easier maintenance, but still.

The Eve version does not remove all glue. It still needs a runner because the agent is being executed inside EAS Workflows, not as a long-lived deployed service. But it moves the agent's operating surface into the Eve project where each concern has a home.

The Eve agent

This is where the experiment got more concrete.

The agent did get smaller, but not because Eve removed all infrastructure around it. We still need an index.ts ”runner” script. In this setup the agent is not deployed as a long-living service. It runs inside EAS Workflows, starts a local Eve server, sends one QA request, writes CI artifacts, uploads screenshots to Vercel Blob, and shuts the server down.

That is a narrower validation than "Eve solved agents". At least not the QA agents running as a part of CI/CD pipeline. It is still useful and opens new possibilities to explore.

The Eve part now looks like this:

scripts/agent-qa/eve/
├── agent/
│   ├── agent.ts
│   ├── instructions.md
│   └── tools/
│       └── agent_device.ts
├── package.json
└── tsconfig.json

agent.ts selects the model. instructions.md carries the mobile QA behavior. agent_device.ts is the only custom tool. Most of the agent rules moved from TypeScript into Markdown, which is exactly where I want them when reviewing how the agent is allowed to behave.

The runner passes clientContext into the Eve session: PR metadata, build ID, workflow URL, platform, application ID, device name, and screenshot directory. The agent uses that context directly. It does not need the runner to assemble a long prompt.

Then Eve gives the runner structured output. That became the most important simplification.

Structured output instead of another tool

The earlier version of this migration had a write_report tool. It worked, but it also duplicated responsibilities.

The current version asks Eve for structured data through outputSchema. The agent returns the pieces CI needs:

overall status
summary
checked items
issues
next steps
screenshot labels

The runner takes that object and owns the rest. It writes report.json, section.md, and status.txt. It scans the screenshot directory. It uploads screenshots to Vercel Blob when the token is present. It formats the GitHub comment data.

That boundary feels better. The agent should decide what happened. The CI runner should decide how that result becomes artifacts.

It also made the code smaller. Excluding lock files and local setup helpers in check.mjs, the migration ended up deleting around 130 lines (so far). That is not a huge number, but it matters because the deleted code was not business logic. It was agent loop scaffolding, tool plumbing, and prompt wiring. As some would say: it’s something!

What still stays outside Eve

EAS still fingerprints the project, reuses or builds native artifacts, boots Android and iOS targets, installs and launches the app, runs the QA command, and posts the PR comment. The index.ts runner still owns the Eve lifecycle in CI. It builds the Eve app, starts eve start, waits for the server, sends the session message, handles missing structured output, writes fallback reports, uploads Blob artifacts, and shuts the process down.

That sounds like a lot, but it is the right kind of glue. It is CI lifecycle code, not agent behavior.

The agent_device tool also runs in Eve's app runtime, not inside the Eve sandbox. That is intentional. agent-device needs access to the EAS job host, simulator or emulator state, platform tooling, and its own daemon/session files. Running it in a separate sandbox would make the setup harder, not simpler. Unless we connected to some remote simulators.

We also removed one source of drift: the workflow no longer installs agent-device globally. The repo already pins it as a development dependency, so the QA scripts resolve the local binary from node_modules/.bin.

So, is this really better?

For this kind of agent, I think yes, but in a specific way.

This rewrite does not prove the whole Eve durability story. The agent is short-living. It does not pause for a human, resume days later, or rely on Vercel Sandbox persistence. In this setup, EAS starts Eve for one CI job and tears it down when the report is done. It also works outside of Vercel infrastructure.

What it does prove is smaller and more practical: Eve gives the agent a clearer shape. The behavior is in Markdown. The model selection is isolated. The only custom capability is a typed agent_device tool. The output contract is explicit. The runner is still code, but it is now mostly lifecycle and artifact handling.

That is already enough to make the agent easier to review and maintain.

The failure modes are easier to see too. Mobile QA agents can test the wrong foreground app, confuse automation infrastructure with the app under test, trust accessibility text when a screenshot is needed, or report blocked for a vague tool-loop reason. Those rules now live in instructions.md, not buried inside a TypeScript array.

I would not call this a final pattern yet. Eve is still in preview, and this is only one workflow. But the migration gave me a better local shape for an agent that runs inside existing CI. It reduced code, kept the output deterministic enough for GitHub comments, and made the important behavior easier to inspect.

It also gave me an idea for a new kind of agent I could deploy on Vercel Sandbox, but more on that soon. You can find the PR with migration from AI SDK to Eve here.

After all, why shouldn’t I keep it?