Why do infrastructure failures in AI systems go unnoticed?

Because most of them fail green. A mount that writes to the wrong place still returns success to the application. A container nearing its memory limit looks healthy until it doesn't. Monitoring built around application errors never sees these, so the system reports fine while quietly losing data or degrading, until a human notices the symptom downstream.

What is the rprivate vs rshared mount problem?

It's a container bind-mount propagation setting. If a mount is private when it needed to be shared, writes can land inside the container's own filesystem layer instead of on the shared volume, so files appear to be written successfully but never reach the real share. The app sees success; the data is gone on the next restart. A classic fail-green.

How do you stop a container from ballooning and starving the host?

Set explicit memory limits on every container, and alert on approach, well before the kill. An AI workload that loads models or buffers data can grow far beyond its steady state. Without a cap, one container can consume the host's memory and take its neighbors down with it. With a cap and an alert, it fails loudly and alone.

Is infrastructure really an AI problem?

For production AI, yes. Model-serving, vector stores, and document pipelines are heavier and stranger workloads than a typical web app, and they expose infra edge cases most teams never hit. The model gets the attention, but the system's reliability is decided by the boring layer underneath it.

Infra failures in a dockerized AI stack (real war stories)

tsukumo

Infra failures in a dockerized AI stack (real war stories) · tsukumo

Short version: the thing that takes down a production AI system is almost never the model or the application code. It is the boring infrastructure underneath, and the worst of it fails green: it reports success while quietly losing data or starving the host. We run a dockerized AI stack for a fiduciary, and the failures that cost us real time were not exotic. They were mounts, memory limits, autoscalers, and proxy routes, each one happy to look healthy while doing the wrong thing.

Why do infra failures hide?#

Because they fail green, and green is invisible to monitoring built around errors.

An application crash throws and pages someone. An infra fault often does not. A mount writes to the wrong place and returns success. A container creeps toward its memory ceiling and reports healthy. A proxy serves a dead route with a clean status line. None of it trips error-based alerting, so the dashboard stays calm while the system degrades underneath. The lesson is the same one that runs through every production war story: completion is not correctness, and you have to watch the infra layer as deliberately as the app.

Here are four that bit us.

1. The mount that wrote to the void#

A container bind-mount was set private when it needed to be shared (the rprivate versus rshared propagation setting). Writes that were supposed to land on a shared volume instead landed inside the container's own ephemeral layer.

The application saw every write succeed. The files were real, briefly. On the next container restart they were gone, because they had never reached the actual share. A perfect fail-green: success reported, data discarded.

The lesson: a successful write call is not a durable write. For anything that must persist, verify the bytes are where you think they are. The return code only tells you the call finished.

2. The container that ate the host#

One container, with no memory limit, ballooned to roughly 184 GB and starved the host. AI workloads load models and buffer data, so they can grow far past their steady state. With no cap, that growth did not stop at the container's own boundary. It consumed the host's memory and dragged its neighbors down with it.

The lesson: every container gets an explicit memory limit, and you alert on approach, well before the kill. A capped container fails loudly and alone. An uncapped one fails silently and takes the block with it.

Where does your team actually stand on this? A short agent-ops assessment is the low-risk way to find out.

3. The autoscaler that tripped a breaker#

Under load, the autoscaler's behavior tripped a circuit breaker, and the protection meant to keep the system up became the reason part of it was down. This is the cruel class of infra bug: the safety mechanism firing at the wrong moment, so the thing you added for resilience is what causes the outage.

The lesson: test your protective machinery under the load that triggers it, never only in the happy path. A circuit breaker or autoscaler you have never watched trip is an untested code path holding your uptime.

4. The proxy serving dead routes#

Traefik endpoints went dead while still appearing to be wired. Requests hit routes that resolved to nothing, and from the outside the proxy looked configured and healthy. Nothing errored loudly; the routes were simply hollow.

The lesson: health-check the actual path end to end, never only whether the proxy is running. A reverse proxy that is up but routing to nowhere is indistinguishable from a working one until someone follows the request all the way through.

The pattern across all four#

Four failures, and the gap between how they looked and what they were

Failure	Looked like	Actually was
Private mount	Successful writes	Data written to the void
Uncapped container	A healthy service	A host-killer in slow motion
Autoscaler / breaker	Resilience machinery	The cause of the outage
Dead proxy routes	A configured proxy	Requests resolving to nothing

Every one of them failed green, and every one would have been caught early by the same habit: verify the real-world effect, not the status code. Did the bytes land. Is memory actually bounded. Does the breaker behave when it trips. Does a request reach the service at the far end. None of that is exotic. It is the unglamorous discipline of treating infrastructure as part of the system you observe, not the part you assume.

What this means for your team#

If you are running AI in production, the model will get the attention and the infrastructure will cause the outages. Model-serving, vector stores, and document pipelines are heavier and stranger than a typical web app, and they surface infra edge cases most teams never hit. Budget real time for the boring layer, instrument it for fail-green, and assume that "the service is up" tells you almost nothing on its own.

We learned these the slow way, by running the stack and catching them in production. If your team is standing up serious AI infrastructure and wants to skip a few of these, that's the work we do.

A short agent-ops assessment maps where your infra reports healthy while quietly doing the wrong thing.

Find the fail-green faults before they find you

Book an assessment →

Short answer: How do you make AI coding agents reliable?.

The infra failures that quietly break a dockerized AI stack

Why do infra failures hide?#

1. The mount that wrote to the void#

2. The container that ate the host#

3. The autoscaler that tripped a breaker#

4. The proxy serving dead routes#

The pattern across all four#

What this means for your team#

AI agents in production: the four operating problems that decide it

Your AI agent didn't fail the deploy. It stopped itself.

How we run a 9-agent growth team on wrai.th (and what broke)

Want this running on your team?