loading
Loading.loading
Loading.Most AI observability tools trace one model call at a time — a prompt, a response, a span. But an agent failure isn't one call; it's a run that looped, stalled, or burned tokens across dozens of steps and sub-agents. Fleet observability watches the whole agent run, not the API span. That's the seam yoru and wrai.th cover.
Updated
The observability category grew up around LLM apps, so its unit is the model call: one prompt, one response, one span. That model is a poor fit for agents. When a coding agent goes wrong, it rarely fails on a single call — it takes a run that loops on the same file, stalls waiting on a tool, or quietly burns tokens across dozens of steps and a fan-out of sub-agents. Read that run as a pile of separate spans and the actual failure is the one thing you can't see: the shape of the trajectory.
Fleet observability moves the unit up to the agent run. The question changes from “which call was slow” to “which agent diverged, where in the run, and what did it cost” — and sub-agents read as one picture, not unlinked traces. It's the seam the call-level tools weren't built for.
| Dimension | LLM call tracers | Fleet / trajectory-native |
|---|---|---|
| Unit of analysis | One model call — a prompt/response span | One agent run: a task, its multi-step trajectory, and its sub-agents |
| Question it answers best | Which prompt was slow, expensive, or low-quality | Which agent stalled, looped, or diverged — and where in the run |
| Multi-agent / sub-agent fan-out | Flattened into separate, unlinked traces | First-class: the fleet and its sub-agents read as one picture |
| Where token waste shows up | Per call — so run-level waste stays invisible as waste | Across the whole run, where the waste actually accumulates |
| Built for | LLM apps and RAG pipelines | Fleets of coding agents working in production |
| Examples | Langfuse, LangSmith, Braintrust, Arize Phoenix | yoru (observability) + wrai.th (fleet control) |
If you run one LLM app or a RAG pipeline, span-level tracing is mature, well supported, and exactly what you want — Langfuse, LangSmith, Braintrust, and Arize Phoenix are good at it, and we'd point you there. The call is the right unit when the thing you ship is a call. Fleet observability earns its place the moment you run several agents and sub-agents across multi-step tasks and need to see the run, not the request. Many teams end up wanting both.
yoru is the observability layer: an open-source audit trail for autonomous coding agents, with a server, a dashboard, and a CLI that streams every tool call, file edit, and red-flag event from the run. wrai.th is the fleet-control layer that gives the run its shape — the orchestration the observability reads against. Both are open-source and built by tsukumo, because we run our own agent fleets in production and needed to see them.
a free, time-boxed agent-ops assessment of where your team actually is