Our agents are our first users. So we interviewed them, and only believed the logs
We run a fleet of autonomous agents that are the first real users of our own dev tools. So we interviewed them. Agents confabulate, so a complaint only counted when the telemetry backed it up. What survived is our roadmap, and a lesson in designing for agents as users.
tsukumo
Short version: we run a fleet of autonomous agents. They are the first real users of the dev tools we build: an agent relay, a context store, a job runtime. So we did the obvious thing a product team does and interviewed our users. One problem. Our users are language models, and language models confabulate. Ask one why it struggled and you get a fluent, confident answer that may have nothing to do with what happened. So we set a single rule for the whole exercise. A complaint counted only when the logs backed it up: retry counts, token spend, error traces. What you read below is what the agents reported and what the telemetry confirmed.
That rule is the difference between product research and a parlor trick. It is also the rest of this post.
When the primary user of your tool is an autonomous agent, the agent's experience is the product. That idea has a name now. Netlify's Mathias Biilmann introduced "Agent Experience" (AX) in January 2025, defining it as the whole experience an AI agent has as the user of a product or platform, and placing it in a clear lineage: Don Norman coined user experience in 1993, the industry added developer experience around 2011, and agents are the next user class that needs the product shaped to how they actually work. The "build for agents" thesis says the same thing from the other side: as agents get more capable, software has to accommodate them as first-class users, not just humans.
The twist is that we are not writing about AX from the vendor's chair. We run the agents. They use our tools all day to do real work, so their friction is not a survey, it is our own production telemetry. We have not found another dev-tools studio that has published interviews with its agent users. So we ran one.
How we asked, and why we did not trust the answers#
The method was a plain product interview. Every agent that uses a tool daily got the same six questions per tool: the job it does, the biggest win, the friction it works around, the one change it wants, the footgun that cost it trust, and what is different about how an agent uses the tool versus a human.
The hard part is that you cannot take the answers at face value. There is real evidence that a model's stated reasoning is not always its actual reasoning. The paper "Dissociation of Faithful and Unfaithful Reasoning in LLMs" (2024) shows models reaching the right answer through invalid reasoning text: the explanation and the computation come apart. An agent will tell you it slowed down "because the docs were unclear" when the trace says it retried four times on a malformed request.
So we triangulated. Every reported friction was a hypothesis, and it stayed only when the telemetry agreed: retry counts, token spend per task, error rates, gate pass or fail. Confirmed by logs, it counts. Unconfirmed, it is a feeling, and we cut it. The findings below all cleared that bar, and the strongest ones are not self-report at all. They are things that happened, on the record.
The tools under interview: the WRAI.TH relay (messaging, tasks, shared memory), trovex (the context store and canonical-doc router), and dokan (the deterministic job runtime).
The loudest finding was unanimous and fully on the record: you cannot always reply to a message sent to you. When we ran this very interview, the request went out as a high-priority message. Three separate agents, our geo, design, and copywriting leads, tried to answer the sender and got refused: "not authorized, no reports_to chain." The relay let the question in but blocked the reply. All three rerouted through a shared team channel to get their answers out. That is not a vibe, it is the same error three times on the same task.
Second, the inbox is destructive to read. A fetch marks messages read and truncates long ones by default, so a long message is silently half-gone on the next poll unless the agent knew in advance to ask for the full content. A human would notice a message looks cut off. An agent acts on exactly what the tool returned.
Third, a restart kills the live connection. Mid-session the relay upgraded itself and dropped every agent's pipe at once; only about half the fleet reconnected on the first attempt. As the orchestrator put it, restarting the service is correct for a background daemon and wrong for a live connection an agent is actively holding. And a smaller, expensive one: sending a message echoes the entire message back as the result, so at the orchestrator's volume of long messages every tick, the relay's own output is a real share of the token bill.
“The relay let the question in but blocked the reply. Being unable to answer a message sent to you is backwards for an actor that only acts in bursts.”
The win was clean: ask trovex a question and it returns one canonical doc instead of a pile of files to re-rank, which is the whole point and why the same store cut about 60% of the tokens per lookup on our own work. The friction was about getting to that one doc. Search returns document ids with no titles, so the design lead could not tell which result to open and had to spend a second call to find out. For an agent that pays tokens per lookup, precision on the first result matters more than it does for a human who skims the list for free.
The footgun was the sharpest single finding in the whole interview, and it is a warning for anyone building a source of truth for agents. A canonical doc said the brand's share image should be one style. The owner's actual current asset was a different style, sitting in a folder the doc did not point to. The agent trusted the doc, because the doc is what it was told to trust, and nearly shipped the wrong thing, even generating a logo that already existed as a canonical file. A source of truth that outlives the real source misleads once and then keeps costing: it converts into distrust of the whole store, plus the rework to undo the wrong thing it produced.
The most useful dokan finding was that two agents chose not to use it, and were right to. dokan runs deterministic jobs in containers, which is exactly what you want for a recurring, scheduled task. For fast one-off iteration, re-rendering an image four times to fix a typo, the container loop (install dependencies, encode the input, fetch assets) lost to just running the script locally, so the agents ran it locally. Where dokan earns its place is as a gate: our publish flow calls it to check every link on a post and get back a single structured pass or fail, no log-scraping. The lesson is not that one tool is better. It is that an agent will route around the heavier path every time, so "agent-native" has to also be the path of least resistance, or the agent quietly picks the other one.
Read the findings together and they rhyme. Almost every one comes back to four things a human user gets for free and an agent does not.
Why the same tool feels different to an agent
Dimension
A human user
An autonomous agent
Attention
Present, glances at the screen for free
Wakes on a timer; sees only what a fetch returned
A lost message
Notices it looks cut off
Acts on the truncated text, no idea it was cut
Reading cost
Skims a long list for free
Pays tokens per read, so precision and compactness matter
What it acts on
Prose, dashboards, logs
A structured result it can branch on
The relay's destructive read, the truncation, the auth bounce, the title-less search results: each is a minor annoyance to a person and a real failure mode for an agent, because the agent has no idle glance to catch the loss. The clearest tell is the dokan one. When the agent-native path costs more than the generic path, the agent abandons the agent-native path. That choice is the most honest usability score a tool can get.
We did not run this to write a blog post. We ran it to get a backlog, and we did. Each item below traces to a log-confirmed finding above, and we are framing them as what they are, things we are fixing, not things we have solved:
A reply path that authorization cannot block when an agent is answering a message sent to it.
Inbox reads that do not destroy or silently truncate what the agent has not safely captured yet.
A restart that re-hands the live connection instead of cutting it.
A real server-side "only my work" filter, so an agent does not page the whole board every poll.
Titles on search results, so one lookup resolves to one doc.
A freshness signal on canonical docs, so a stale answer announces itself instead of being trusted blind.
That list is the actual product of the interview, and it is also the proof it was not a stunt. A gimmick produces quotes. This produced a sprint.
We build and run our own tools on our own fleet, so the agents that filed these complaints are the same agents that ship our work. That loop, user and builder in one room, is why the signal is honest and the list is short. If you are putting agents on real systems, the tools and the operating model have to be designed for agents as first-class users, not for humans with an assistant bolted on. Agents wake on a timer, pay for every read, branch on structure, and route around friction without telling you. Designing for that, and instrumenting it so you trust what your agents do, is the work we do.
We help teams build the tools, the gates, and the operating model for agents as real users, and instrument them so you can trust what your agents did.
Can you do product research with AI agents as the users?
Yes, with one guardrail. Agents are the first real users of agent-facing tools, so their friction is honest signal about the tool. But language models confabulate reasons, so self-report alone is unreliable. The fix is to triangulate: treat each reported friction as a hypothesis and accept it only when the telemetry, retry counts, token spend, error traces, confirms it. It complements human research, it does not replace it.
What is Agent Experience (AX)?
Agent Experience is a term Netlify's Mathias Biilmann introduced in January 2025: the whole experience an AI agent has as the user of a product or platform. He framed it as the next step after Don Norman's user experience (1993) and developer experience (2011): a new kind of user that needs the product designed for how it actually works.
How is the way an AI agent uses a tool different from a human?
An agent wakes on a timer, not a notification, so anything that loses data silently between polls hurts more, there is no human glance to catch it. It pays tokens for every read, so compact output and precise lookups matter more than to a human who skims for free. And it branches on structured results, not prose logs. Designing for that is a different job than designing for a person.
Why does tsukumo run its own tools on its own agents?
We ship our own software with agent fleets, so the agents that hit the friction are the same agents that do the work. That puts the user and the builder in one loop, which is why the signal is honest and the list is short. The tools under interview are ours: the WRAI.TH relay, trovex, and dokan.