We benchmarked the token cost of rereading docs on our own repo
We ran 20 real queries against our own 145-doc repo with a local embedder and published the script. Resolving a question to one canonical doc instead of rereading the top candidate files saved about 60% of tokens per lookup. Here is the number, and the one command to reproduce it.
tsukumo
Short version: we got tired of quoting a representative number, so we wrote a benchmark and ran it on our own repo. The finding: when an agent resolves a question to one canonical doc instead of rereading the top candidate files to work out which is current, it spends about 60% fewer tokens per lookup. Twenty real queries, 145 docs, a local embedder, one command. The script is in the repo. The proof is that you can run it yourself.
That post explains why the waste exists. This one measures it.
We took the obvious next step. We pointed a benchmark at our own repo, counted the tokens an agent burns triaging candidates, counted the tokens it would spend reading one canonical pointer instead, and looked at the gap.
~60%
fewer tokens per doc lookup
Conservative headline we put our name on, measured on the trovex repo
Source: trovex benchmarks/token-savings
That 60% is the floor of what we saw, not the average. The average is higher. We lead with the conservative figure because it is the one we are willing to defend without a footnote.
We built the benchmark around the same code path that powers our live /savings dashboard, so the number you reproduce is the number the product reports.
Here is a slice of the per-query rows, including the one that saved the least:
Per-query tokens: triaging candidates vs resolving to one canonical doc
Query
Would read
With trovex
Saved
Reduction
quickstart getting started guide
6,609
1,317
5,228
79.1%
how is the token savings number calculated
3,813
1,074
2,673
70.1%
configure the embedding model
4,199
1,577
2,552
~61%
how do I install and set up trovex
6,823
3,651
3,102
45.5%
Look at the last row. The install query saved 45.5%, well under the headline. The candidate docs there overlap heavily and the canonical answer is itself long, so resolving to one doc helps less. That is the honest shape of the data. A question with one obvious home saves a little. A question buried in a thicket of half-stale duplicates saves a lot. The 60% is what falls out across a mix of both.
Yes. That is the deliverable, more than the number.
bash
python benchmarks/token-savings/run.py # this repo, local embedder
python benchmarks/token-savings/run.py --repo /path # any repo of .md docs
python benchmarks/token-savings/run.py --json out.json # machine-readable
One command on this repo, --repo for any folder of markdown, --json for machine-readable output
The first run downloads the roughly 90MB embedding model once. After that there is no API key, no cloud call, no state left on disk. The run is deterministic, so two runs on the same docs give the same answer. Point it at your own folder of markdown and you get your real reduction on your real docs, in about a minute.
If your number comes back small, your repo does not have a rereading problem and you should keep your setup. If it comes back large, you were paying that tax on every session, every agent, every teammate.
Yes, and this is the number to trust over ours. We compute savings two ways. The per-query estimate above models the avoided work: tokens of the top three candidates minus the one canonical doc. The second way reads what your agent actually did.
When trovex routes an agent's doc reads, a hook logs each .md read to ~/.claude/trovex-baseline.jsonl. Then trovex measure compares two windows of that log, a baseline from before you routed through trovex and a current window after, and reports the reduction in tokens spent per .md read between them.
bash
trovex measure # last 7 days vs the prior 7
trovex measure --baseline-days 14 --current-days 14
Before/after on your agent's own logged reads, not a synthetic benchmark
This is the honest one. It assumes nothing about your baseline behavior, the way the per-query estimate assumes you would have read three candidates. It reads your real reads, before and after, and reports the actual change. If you run only one number on your own setup, run this one.
Why does resolving to one canonical doc save tokens?#
Because the expensive part was never the answer. It was the guessing.
Without a canonical layer, the agent reads three files to find the one that is current and discards two. With one, it reads a pointer line and the single doc that answers. The discarded reads vanish, and so does the quieter cost of a polluted context window nudging the model toward a stale answer. This is the canonical-doc layer the agent-memory taxonomies keep missing, and the benchmark is just the arithmetic underneath it.
“The waste was never the answer. It was the agent rereading two files to work out which third file to trust.”
— the finding
One detail matters for whether you can trust the headline. The savings ratio is governed by token volume, not by which embedder ranks the candidates. A different embedder shuffles the ranking, but the count of tokens read versus tokens saved barely moves. So the 60% is stable across embedders even though we ran bge-small. That property is documented in the script itself.
No, and any benchmark that promised otherwise would be lying to you.
We measured 60% on the trovex repo. A repo with little overlapping markdown will save less, because there was less rereading to remove. A repo thick with duplicates and stale runbooks will save more. The benchmark is not a guarantee you will hit our number. It is a tool to find yours.
“The /savings surface reports would-have-read versus actual tokens on doc lookups using Searcher.savings_estimate, the same code path the benchmark script runs offline.”
We did not want to be one more vendor quoting a tidy percentage with no way to check it. So we shipped the script before we shipped the claim. The number we put our name on, 60%, is the conservative floor of what we measured on our own repo, run through the same code as our dashboard, with a command you can run on your own docs tonight.
The discipline is simple. Measure on first-party data, publish the weak query alongside the strong one, hand you the tool to falsify the result. That is how we work. If you are rolling agents out across a fleet and the token bill is becoming a line item, that is the layer we fit and the number we help you keep honest.
We map where your agents reread instead of resolve, run the benchmark on your repo, and leave you with a figure you can reproduce without us.
About 60% fewer tokens per doc lookup, measured on our own repo. Across 20 real queries against 145 docs we saved 59,668 of 89,585 tokens: a pooled 66.6%, a median of 71.1% per query, a mean of 67.1%. We publish 60% because it is the conservative end of what we measured. Per-query savings ranged from 45.5% to 79.1%. Your repo will land somewhere of its own.
Is the benchmark reproducible?
Yes, that is the whole point. The script ships in the trovex repo at benchmarks/token-savings/run.py and runs the same code path as the live /savings dashboard. Run it on this repo, or point it at any folder of markdown with --repo. It uses a local embedder, needs no API key, is deterministic, and leaves no state behind. Do not trust the number, run it.
What embedder does the benchmark use?
BAAI/bge-small-en-v1.5, run locally through fastembed. The first run downloads a roughly 90MB model once, then everything stays on your machine. No cloud, no API key. The savings ratio is governed by token volume, not by which embedder ranks the candidates, so the headline holds across embedders even though the exact ranking shifts.
Does this hold on my repo?
We do not know, and we will not pretend to. The 60% is measured on the trovex repo, not a universal guarantee. A repo with little overlapping markdown will save less. A repo thick with half-stale duplicates will save more. The benchmark exists so you can get your own number in about a minute instead of taking ours on faith.
Is this per-lookup or whole-bill savings?
Per-lookup, on doc reads. It measures the markdown-rereading slice of an agent's work, not the code it reads or the reasoning it runs. On a doc-heavy repo with a lot of agent traffic that slice is large. On a tiny repo it is small. The benchmark reports exactly that slice, so you can see whether it is worth fixing on your codebase.