1 min read

6/26/2026

Link copied to clipboard!

Share on X

Share on Facebook

Copy Link

0:00/

Listen to Article ()

How Expensify Uses Agent-Device for Mobile Bug Evidence and Profiling

Authors

Kacper Mikołajczak

Software Developer

Callstack

Bartłomiej Obudziński

React Native Developer

Callstack

AI coding agents are already strong at general development work. On the web, they can inspect pages through browser tooling. Mobile raises the bar: to be useful in the full loop, agents need to run the app, inspect the screen, and collect proof across simulators, emulators, and real devices.

Today, developers still handle much of that on-device work manually. Expensify's app is built by hundreds of contributors, so we have been exploring how agent-device can move those repeated mobile workflows into the agent loop and free engineers to spend more time on the product work.

What agents lacked

Point a coding agent at a mobile codebase and it can already do a lot: read the code, follow a navigation regression to its source, and draft a plausible fix. What it could not do was see whether the fix worked on a real screen.

It was possible to reach the device through native tools, but each platform has its own interface, and those tools do not give the agent what it needs most: a clean, structured read of what is on screen.

The reasoning half of the loop was handled. The mobile half, including simulators, builds, taps, and recordings, was still mostly in developers' hands.

What agent-device adds

agent-device is the piece that picks up that mobile half. Under the hood, it uses platform backends and low-level tools such as adb and simctl. It hides that fragmentation behind one consistent API, so the agent learns a single way of working instead of a different dialect for each platform.

The bigger shift is how the agent sees the app. Instead of a screenshot, a snapshot returns the screen as an optimized and structured accessibility tree. That is semantically richer than an image and far cheaper in tokens. It acts on those elements through plain commands like press, fill, and find by label, role, or text, rather than guessing at coordinates.

Page: Settings
Snapshot: 14 visible nodes (32 total)
Collapsed 18 Android helper nodes
@e1 [scroll-area] "com.android.settings:id/settings" [scrollable]
@e2 [group] "Search Settings"
@e4 [scroll-area] "Wallpaper & style, Display & touch ..." [scrollable]
  [content below scroll-area hidden]
@e5 [group]
@e6 [list]
  [content above list hidden]

That is the core idea: the agent gets hands and eyes. It can read the screen and act on it instead of only reasoning about code it never gets to run.

agent-device also lets agents profile performance, debug native crashes, inspect logs, inspect network requests, connect to React DevTools for React component optimization, run Maestro tests, batch commands, or connect to a remote simulator. Check the project’s docs for more.

How we use agent-device at Expensify to speed up bug-fix evidence

This bottleneck is familiar to mobile engineering teams: a bug report comes in with repro steps and affected platforms. A developer reproduces it, writes a fix, and then needs to ship proof: screen recordings on every affected platform showing the bug is gone, attached to the PR.

At Expensify, collecting this evidence is mandatory. Across hundreds of issues and PRs, multiplied by the number of affected platforms, even an optimistic five minutes per platform turns evidence collection into a costly part of the fix.

Where agent-device-evidence kicks in

To automate this part of the development loop, we built agent-device-evidence. It is an internal skill that helps collect evidence and iterate on fixes.

You hand it a PR or issue link; it fetches the body, parses the declared repro or test steps, and reads the issue's platform checkboxes to decide where to run. Then it works in two phases. The warm-up phase drives the steps autonomously from a cold start, one at a time. It then distills the successful actions into an .ad script, leaving out retries and dead ends. The recording phase replays the script to collect evidence for each flow and platform that a developer would otherwise handle manually.

To show how the pipeline fits together, here is how the agent works on a real report: #89526. The warm-up phase starts by validating that the bug exists:

agent-device goes through the app to verify that the bug exists

agent-device-evidence output details include recording that shows the bug

Iterating over the fix with pre-defined repro steps

Reproducing the bug once is only the start. The real time goes into the fixing cycle, and that is where the recorded steps earn their keep: instead of re-walking the bug by hand after every change, we replay the saved script and let it tell us whether we are getting closer.

The .ad script is the artifact. Here is the real one from this bug:

# @desc   Issue #89526 repro: open workspace Categories > More > Settings,
#         then tap a default spend category.
press "label=\"Inbox\""
press "label=\"Workspaces\""
press "role=\"link\" label=\"Workspace name: ${WORKSPACE_NAME}, Default, Owner: ${OWNER_NAME}, Workspace type: Collect\""
press "text=\"Categories\" || role=\"cell\" label=\"Categories\""
# @record-start
press "text=\"More\" || role=\"button\" label=\"More\""
press "text=\"Settings\" || role=\"button\" label=\"Settings\""
press "text=\"Commuter, Car\" || role=\"cell\" label=\"Commuter, Car\""
is exists "text=\"Hmm... it's not here\" || text=\"Oops, this page cannot be found\""

From there, the cycle is quick. The agent replays, watches, and reports. The developer reads the result and decides what to try next. This is cooperation rather than handoff: the developer brings the judgment, and the agent brings the hands.

Closing the loop: after evidence

With the fix in place, one more run of agent-device-evidence produces the "after" videos. It reuses the prepared .ad script and replays it on the platforms you have set up. For this issue, we captured the iOS before/after pair, and the same flow runs on the other configured targets without re-authoring. The output is a run folder with MP4s, stills, and a manifest ready to attach to the PR.

How we used agent-device to measure Sentry performance spans

Performance work on mobile has a specific frustrating rhythm. You suspect something is slow, run it manually a few times, average the numbers by hand, switch branches, and repeat. That process can take longer than the fix itself, and one bad run can distort the result.

We ran agent-device against the Expensify app, a large, open-source React Native codebase with real Sentry instrumentation. That gave us a concrete workflow to improve rather than a synthetic benchmark.

Before this, measuring a Sentry span meant running the interaction a few times manually, writing down the numbers, averaging them by hand, and repeating it on another branch to compare. With agent-device, the agent handles the measurement loop as part of normal investigation and optimization work.

The setup

Sentry spans are how the Expensify team tracks how long interactions take. Each time a tracked interaction completes, the app emits a project-specific log line:

[Sentry][<SpanName>] Ending span (<N>ms)

That format is what the agent watches. When a span fires, the duration lands in the console in a consistent shape. The agent grabs it, collects it across runs, and produces a summary.

Before running any measurements, we prepared .ad flow files for the most important Sentry-instrumented interactions. The agent generates these files when instructed, and you can modify them to fit your needs. Each flow is tagged with the span name, so the agent can find the right one without guessing. That is what keeps the agent on track: it does not explore the app; it follows the prepared file.

The skill

We built a skill with specific rules for how to measure telemetry spans. The agent runs one warm-up replay, then a configured number of measured replays, then outputs a structured summary:

The first run came in at 638ms, a clear cold-cache outlier, despite the warm-up run. The remaining four were tight at 85–88ms, with a median of 87ms. That kind of distribution is exactly why you need multiple runs. A single measurement would have told us nothing, or worse, something wrong.

Comparing branches is still manual: check out each revision, run the measurement, and compare the two summaries. That is fine. The hard part was never the comparison. It was getting a trustworthy number from each branch in the first place.

The result

Four consistent runs landed at 85–88ms. The first came in at 638ms, a cold-cache outlier that slipped past the warm-up. We kept it in.

That is what this workflow gives you: enough runs to see the shape of the data, not just a single number that might lie. For any future change touching this interaction, we now have a baseline to compare against. The span does not just "feel fast." The median run is 87ms, and we can prove it.

Profiling with a single prompt

Telemetry spans tell you how long something took. They do not tell you why.

Performance debugging on a development app is rarely clean. You open a simulator, navigate manually, start the profiler at exactly the right moment, read the flame graph, and try to remember what you tapped and in what order. Then you do it again when you want to verify. Each pass depends on you executing the same sequence the same way, which you almost never do.

That is the part agent-device changes. We can turn that manual flow into a prompt like this, run against the iOS app in development mode:

navigate to the Inbox, tap into a chat, and capture what the React profiler sees. No guessing; every finding backed by actual render data.

agent-device has a react-devtools integration that lets the agent start and stop the React profiler programmatically, mid-session. The agent executes the full journey autonomously: opens the app, triggers profiling at the right moment, walks through both screens, and returns a structured summary.

What the agent actually returned

This is the part worth being concrete about.

The output was not a generic list of "common React Native performance issues." It was specific to the Expensify Inbox and chat screens, on that device, during that session. Components were named. Render counts were cited. The agent flagged what was expensive and why, not what could be expensive in theory.

‍
That is the real gain. You save time on the first run, and you can run the same prompt after a fix to get a direct comparison. Run it. Change something. Run it again. The numbers either move or they do not.

Should you try it?

Bug evidence, span measurement, and render profiling all follow the same pattern: give your agent hands and eyes for one chore your team already repeats. Install agent-device, point it at that chore, and wrap the workflow in a skill. The docs and official skills give you the starting point. The first time you paste an issue URL instead of booting a simulator, you will know whether it earns a place in your workflow.

Table of contents

This is some text inside of a div block.

Integrating AI into your React Native workflow?

We help teams leverage AI to accelerate development and deliver smarter user experiences.

Let’s chat