Five silent failures in a production invoice pipeline
The bugs that hurt a production pipeline don't crash. They return 200, paint the dashboard green, and quietly stop doing their job. Here are five we hit running an agentic accounting platform, and how we caught them.
tsukumo
Short version: the bug that hurts a production pipeline almost never crashes. It returns 200, the dashboard stays green, and the system quietly stops doing its job. We run an agentic accounting platform for a Swiss fiduciary, where the pipeline turns documents into posted invoices with money attached. The loud failures there were the easy part. They paged us, we fixed them, done. The five below are the ones that cost real time, because every one of them was shaped like success.
A crash is honest. It throws, your alerting catches a non-200 or a failed job, someone gets paged, and the blast radius is one run. You lose minutes.
A silent failure lies. The job reports success while doing nothing, so monitoring built around exceptions never fires. The blast radius is every run between the day it broke and the day a human happens to notice the output is wrong. In an invoice pipeline that human is often the client, and the gap is measured in weeks.
That asymmetry is the whole point. If you only invest in catching loud failures, you have optimized for the cheap ones. The expensive failures are the quiet ones, and they need a different kind of instrumentation: you have to watch for the absence of expected work, not the presence of errors.
We cached the "pending invoices to process" lookup in Redis to keep the hot path fast. At some point the cache populated with an empty result during a window when the upstream was briefly unavailable, then kept serving that empty result well past its intended life.
The pipeline did exactly what it was told. Zero pending invoices means zero work, so it ran clean and green, every cycle, for weeks. Nothing errored because nothing was wrong from the code's point of view. The fix was cheap once we saw it. The detection was the hard part.
The lesson: a cached value that implies "nothing to do" is a claim, not a fact. Reconcile it against an independent count on a schedule.
A document fetch returned a 404 for a batch of records. The handler was written to treat a 404 as "not there yet, skip and move on," which is reasonable for a genuinely missing record. Here the 404 came from a different cause, and around 40 invoices got silently parked, skipped on every subsequent run because the code had decided they did not exist.
No error, no retry, no alert. Just a slowly growing set of invoices the pipeline had quietly agreed to ignore.
The lesson: "skip" and "fail" are not the same status, and collapsing them is how you lose records. A skip should be a counted, time-boxed state with an upper bound. Anything skipped more than N times is an error, not a skip.
One queue used an "in review" status to mean "a human needs to look at this." Over time, a code path started writing "in review" when it caught an internal error it didn't know how to handle. The error got swallowed, the record landed in a human queue that nobody was watching closely, and the dashboard counted it as normal pending work.
So a genuine fault was disguised as routine human-in-the-loop. The number on the dashboard was real. Its meaning was not.
The lesson: never overload a "needs attention" status to also mean "something broke." A masked error in a normal-looking bucket is invisible by construction. Give faults their own terminal, loud status.
The platform posts to Business Central's cloud API. Under load it returns occasional 409s, which are transient. Our code, early on, treated a non-2xx as a final failure and marked the invoice accordingly, so a retryable hiccup turned into a stuck record that needed manual rescue.
This one is the inverse of failure 2. There we treated a real failure as a benign skip. Here we treated a benign, retryable failure as a real one. Both come from the same root cause: a handler that maps a status code to an outcome without understanding what the status code means in context.
The lesson: classify upstream errors by whether they are retryable, not by whether they are non-200. A 409 from an ERP under load and a 400 from a malformed payload are not the same event and must not share a code path.
The most boring one, and the most common. A scheduled job ran, found nothing to process because of an upstream filter that had quietly changed, and reported success. Success on zero work looks identical to success on a full batch if the only thing you assert is "did the job finish without throwing."
A job that processes zero items is sometimes correct and sometimes a five-alarm fire. The code cannot tell the difference. Only a baseline can.
The lesson: assert on expected throughput, not just completion. "Finished without error" is a weak signal. "Processed a number of invoices within the range this pipeline normally sees" is a strong one.
Across all five, the same three habits would have caught the failure in hours instead of weeks. None of them are exotic. They are the unglamorous part of running things in production.
| Habit | What it catches | |-------|-----------------| | Reconcile against ground truth | Stale caches, skipped records, lying statuses. Compare what the pipeline thinks it did against an independent source of truth on a schedule. | | Alert on absence, not just errors | Empty batches, frozen queues. "We expected work and saw none" is an alert condition. | | Give faults a loud, terminal status | Masked errors. A fault must never share a bucket with normal pending or human-review work. |
The unifying idea is that completion is not correctness. A pipeline that finishes without throwing has told you almost nothing. You want it to tell you what it did, compare that to what it should have done, and shout when the two disagree.
What this means if you're putting AI agents in the loop#
Silent failures get worse the moment an agent sits in the pipeline, because an agent will confidently narrate a clean run over a system that did nothing. The model has no independent way to know the cache was stale or the batch was empty. It reports on the data it was handed, and if that data says "zero pending, all good," that is what you will read in a tidy paragraph.
This is why we tell every team the same thing: the hard part of running AI in production is not the model. It is the observability and reconciliation around it, so that when something quietly stops working, a human finds out from a system rather than from a client. We learned these five the honest way, by shipping them and then catching them. You can have the lessons without the weeks.
If your team is putting agents or automation in front of real money or real records, that reconciliation layer is the work that keeps it trustworthy. That's the kind of thing we help teams build.
A silent failure is a fault that produces no error and no alert. The job returns success, the dashboard stays green, and the work simply doesn't happen. Because nothing throws, monitoring built around exceptions never sees it. You find out when a human notices the output is wrong, often weeks later.
Why don't silent failures trigger alerts?
Most alerting watches for errors: non-200 responses, exceptions, failed jobs. A silent failure is shaped like success. A cache returns a valid (but stale) value, a handler catches a 404 and moves on, a status field reads "in review." To catch these you have to alert on the absence of expected work, not the presence of errors.
How do you detect a stale cache serving bad data?
Cache the result, but reconcile against ground truth on a schedule. If a cached value implies zero work to do, treat that as a claim to verify, not a fact to trust. A simple counter ("invoices processed in the last 24h") compared against an independent source query will surface a cache serving a stale empty result long before a human does.
Should AI pipelines fail loud or fail safe?
Both, in order. Fail safe so a transient fault doesn't corrupt data or double-post, then fail loud so a human knows it happened. The dangerous middle ground is failing quiet: catching the error, doing nothing, and reporting success. With AI agents in the loop this gets worse, because an agent will confidently narrate a clean run over a pipeline that did nothing.