MQTT at fleet scale: what a few thousand IoT devices and one bad Postgres password taught us

A misconfigured Postgres credential turned an ordinary EMQX cluster into a 50,000-connection-attempt-per-minute self-DDoS. Here is what we changed about authentication, observability, and runbook design after that night.

tl;dr. A single misconfigured Postgres credential turned an ordinary EMQX cluster into a 50,000-connection-attempt-per-minute self-DDoS. The fix took ten minutes once we found it. The post-mortem took two days. Here is what we changed about authentication, observability, and runbook design after that night, and what every team running an MQTT fleet of more than a few thousand devices should do before they have their version of this incident.

Your IoT fleet does not fail gracefully. It fails the way an angry crowd fails: all at once, all reconnecting, all hitting the same backend at the same instant, asking the same question, getting the same wrong answer. If you have not had this kind of incident yet, it is because your fleet is still small, or because you have been lucky.

This is the story of the night we ran out of luck at Pstryk, and the four things we changed afterwards. None of them are exotic. All of them are the kind of thing you should have in place before the incident, not after.

The incident in ninety seconds

Late evening. A routine deployment to a staging environment somehow leaked a configuration change into production. The change rotated a Postgres credential that EMQX, our MQTT broker, used as its authentication backend for one of the SSL listeners. The new password was correct. The configuration that pointed at it was not.

Within thirty seconds, every IoT device in our fleet that connected to that listener got an authentication failure on its next reconnect cycle. The MQTT client SDKs on those devices did what MQTT client SDKs do: they retried. Quickly. With a tight default backoff that, at fleet scale, was not a backoff at all.

The cluster started seeing roughly 50,000 connection attempts per minute against a Postgres instance that, on a normal day, served maybe 800 auth queries per minute. Postgres CPU saturated. Auth queries timed out. Other unrelated queries from other services started timing out as collateral damage. On-call alerts went off in three different channels at the same minute, none of which immediately pointed at the actual root cause.

The fix, once we located the actual cause, was to disable authentication on that specific listener temporarily, let the fleet reconnect, restore the correct credential, and re-enable authn. Total downtime as the user saw it was about twelve minutes. Total time spent understanding why was the rest of the night and most of the next day.

Why MQTT fails this way and HTTP does not

If a similar credential rotation had happened on an HTTP API, the failure mode would have been quieter. HTTP clients tend to have human users behind them or scheduled jobs in front of them. They retry slowly, or not at all. They do not maintain persistent connections that have to be re-established the moment they drop.

MQTT is different in three ways that matter under failure.

First, MQTT clients are designed to maintain a long-lived persistent connection. That is the whole point of the protocol. When the connection drops, the client's job is to re-establish it as quickly as possible, because in the meantime it cannot deliver telemetry or receive commands.

Second, fleet-wide connection drops correlate. If the cause of the drop is on the broker side, every device on that listener experiences it at almost the same moment. There is no natural smoothing across the fleet, the way there would be across a population of human users with browsers.

Third, the default reconnect logic in most MQTT client SDKs is too aggressive for fleet scale. The libraries were written for individual devices, not for hordes of them. The default exponential backoff, if it exists at all, often starts at one second and caps at sixty, which is far too tight for a fleet of several thousand.

Combine those three properties and the MQTT layer behaves, under any broker-side failure, like a self-organizing distributed denial of service against whichever component caused the failure. In our case that was Postgres. In yours it might be your DNS, your TLS terminator, or your auth service.

What we changed about authentication

The first lesson is that an MQTT broker should not authenticate against a system that can experience load-correlated failures with the broker itself. Postgres is fine when it works. When it is the same Postgres that serves your application traffic, and the broker is suddenly hammering it with retries, you have a single point of failure dressed up as two services.

We moved authentication to a JWT-based scheme, with the tokens issued by a small dedicated service and verified by the broker against a public key. Token verification has no backend dependency at the moment of authn beyond a key lookup that can be cached for hours. A failed deployment of the token issuer no longer takes down the broker, because the broker does not contact the issuer at all during normal operations.

There are other valid choices. mTLS with certificates issued at provisioning time gives you even stronger isolation, at the cost of more operational complexity around certificate rotation and revocation. File-based password databases work for very small fleets and are cheaper than they look, but they do not scale beyond a few hundred devices because pushing updates is painful. The point is not the specific mechanism, the point is that your authentication path should not share fate with services that the broker itself can stress under failure.

What we changed about reconnect logic

The second lesson is that the default reconnect logic in your client SDK is almost certainly wrong for your fleet, and you have to fix it on the device side before you trust any broker-side mitigation.

Our updated logic looks like this. The first reconnect attempt happens after a random delay between five and fifteen seconds. The second after a random delay between thirty seconds and two minutes. After that, the backoff caps at five to ten minutes with continuous jitter. The randomness is the part that matters, because without jitter you get reconnect storms even with long delays. Every device on the fleet timing out at the same second, then sleeping for the same sixty seconds, then retrying at the same second again, is a failure mode that looks suspiciously like the original problem.

We also added a circuit breaker on the device side. If the client experiences three consecutive auth failures, it goes into a longer cool-down, an hour or more, and reports the failure through the secondary out-of-band channel that every IoT fleet should have for exactly this reason. A device that cannot connect to its primary broker should not spend the next four hours hammering on the door. It should sit down, wait, and tell someone.

What we changed about observability

The third lesson is that the observability that catches an incident like this in thirty seconds is not the same as the observability that runs your dashboards.

Three metrics matter, and you should alert on all three.

Connection attempt rate per listener is the canary. A normal listener has a stable connection rate that wobbles slowly with the daily fleet pattern. A connection rate that doubles in a minute, regardless of the cause, is always something you want to know about.

Auth failure rate as a percentage of connection attempts is the differentiator. You will always have some auth failures from misconfigured devices and stale credentials in the long tail of the fleet. The number to watch is the percentage. A percentage that goes from one or two percent to fifty percent in under a minute is not a device problem, it is a broker or backend problem.

Broker CPU per shard, plus auth backend latency, are the lagging indicators that confirm the cause. By the time these move you already know there is a problem, but you need them to triage which subsystem is the actual fault.

We tied all three into a single page, and we tied that page into the on-call alerting with thresholds tuned to our specific fleet shape. The alert that would have caught this incident in thirty seconds was the auth failure rate alert, and we did not have it before.

What we changed about runbooks

The fourth lesson is that the runbook for this incident did not exist that night, and we wrote it the next morning. It is short.

If broker auth is failing fleet-wide, disable authentication on the affected listener immediately, even though it feels wrong, because the alternative is hours of service degradation. Confirm fleet reconnection. Investigate the root cause with the broker still serving traffic. Re-enable authentication only after the root cause is fixed and you have confirmed in staging that the authentication path is healthy.

The instinct of a careful operator is to leave authentication on, because authentication is a security boundary. The instinct of a tired operator at 11pm is to do whatever stops the alerts. The runbook exists so you do not have to choose at midnight, because the runbook tells you that bypassing authn temporarily on a controlled listener for a controlled fleet is the right move, and it tells you under what conditions to undo it.

Write that runbook before you need it. Practice it once a quarter. The team that has practiced the bypass-authn drill is not the team that hesitates at midnight.

What does this mean if you are running an MQTT fleet

If your fleet is under a thousand devices, you can probably get away with most of the defaults for another year. If your fleet is between one and ten thousand, you are in the zone where this kind of incident becomes likely, and the four changes above are roughly the floor of what you should have in place. If your fleet is above ten thousand, none of this is news to you, and you have probably already lived through your own version of this story.

The deeper lesson, the one I keep coming back to, is that distributed systems do not fail the way single systems fail. They fail in correlated ways that amplify the smallest mistake into a fleet-wide event. Authentication, reconnect logic, observability, and runbooks are the four levers that turn correlated failure into bounded failure, and you build them while the fleet is still calm, because you do not get to build them during the incident.

If your team runs an IoT fleet at scale and any of the above sounds painfully familiar, this is exactly the kind of work I help operators with through the consulting page. The first conversation is usually about the runbook you do not yet have.

Mateusz Kozak Fractional CTO / Warsaw

CTO at Pstryk. I help climate, energy, and AI startups ship hard technical products. If this piece resonated and you're building in adjacent territory, that's exactly the conversation I want to be having.

Get one essay like this in your inbox roughly twice a month.

No spam, no upsell, easy unsubscribe.