Agency17 June 20264 min read
The infra failures nobody warns you about in a dockerized AI stack
The thing that takes down an AI system in production usually isn't the model or the app. It's the boring infrastructure underneath, and it fails green. Four real ones from a dockerized AI stack: a mount writing to the void, a runaway container, a tripped autoscaler, dead proxy routes.

Short version: the thing that takes down a production AI system is almost never the model or the application code. It is the boring infrastructure underneath, and the worst of it fails green: it reports success while quietly losing data or starving the host. We run a dockerized AI stack for a fiduciary, and the failures that cost us real time were not exotic. They were mounts, memory limits, autoscalers, and proxy routes, each one happy to look healthy while doing the wrong thing.
Why do infra failures hide?#
Because they fail green, and green is invisible to monitoring built around errors.
An application crash throws and pages someone. An infra fault often does not. A mount writes to the wrong place and returns success. A container creeps toward its memory ceiling and reports healthy. A proxy serves a dead route with a clean status line. None of it trips error-based alerting, so the dashboard stays calm while the system degrades underneath. The lesson is the same one that runs through every production war story: completion is not correctness, and you have to watch the infra layer as deliberately as the app.