When 512 worker_connections were not enough

One recent morning I opened speed.packetlog.org in a browser to check something and got a 502 Bad Gateway. That is how I found out I had been slowly strangling my own Nginx.

First guesses, mostly wrong

My first thought was memory. The VPS is small, and I had added a couple of things since the last retrospective, so I assumed something was leaking. free -m showed nothing unusual. Not memory.

Next I thought the speedtest-go process had crashed. Its systemd unit has Restart=always, so that would have been unusual, but still worth checking. systemctl status speedtest showed the process running, uptime three weeks. Fine.

Which left Nginx.

The actual problem

I opened /var/log/nginx/error.log, which I had not read in weeks. The recent entries were all this:

[alert] 512 worker_connections are not enough, reusing connections

Every time that line appears, Nginx is refusing a new connection and reusing an existing socket. From the client’s perspective, the request hangs and then fails. From Nginx’s perspective, the pool is saturated.

What I got wrong last October

In my October Nginx notes I called 512 worker_connections more than enough and said I would never come close to filling it. Narrow framing. For a personal blog serving cached HTML, 512 is fine. The self-hosted speed test is a different story. Long-lived upload and download streams, ping loops, keepalives that take a while to drain — the pool does not turn over the way it does for a cached homepage.

Looking back through the access log I cannot point at a single cause. Most likely some combination of slow upstream reads holding connections open longer than usual and a few clients reconnecting before old sockets cleared. Whatever the exact path, 512 turned out to be a ceiling I could reach without trying.

Monitoring that missed it

The part that bothered me more than the outage itself: I run external uptime monitoring specifically so I do not find problems by accident. The monitor had not fired. Looking at its configuration, it was pointed at https://packetlog.org/, the blog homepage. That is a small HTML file Nginx serves from cache, barely touching any upstream. A healthy homepage and a broken speed test do not contradict each other, and the monitor was never going to notice.

I was watching the wrong thing. A basic methodology mistake.

The fix

Two changes in /etc/nginx/nginx.conf:

events {
    worker_connections 4096;
}

worker_rlimit_nofile 8192;

The second directive matters because each connection consumes a file descriptor, and the systemd default limit of 1024 would cap Nginx well before 4096 connections. worker_rlimit_nofile raises that ceiling for Nginx workers specifically.

nginx -t && systemctl reload nginx. No restart needed; existing connections survive a reload. I watched the error log for a few minutes. Nothing new.

What is different now

I added a dedicated monitor for speed.packetlog.org that checks for the string LibreSpeed in the response body, not just a 200 status. A 502 from Nginx can, in some edge cases, still look fine to a monitor that only checks status codes, so content matching is the sturdier test.

I also added a line to the change log I started keeping after the six-month retrospective, a terse entry about the worker_connections bump and why. That note will outlive my memory of why I touched the config.

On defaults

The server has been quiet since. I am not certain 4096 is the right number; I picked it because it is round, not because I measured. If the alert comes back I will revisit. For now this is one of those situations where “more than enough” turns out to be a statement about assumptions rather than capacity. I had built the assumption six months ago and never stopped to check whether it still held.

Small update, a day later: the speedtest-go systemd unit also needed LimitNOFILE=4096 to avoid its own file descriptor cap on upload streams. Added it after spotting a second, rarer set of errors in journalctl -u speedtest.

First guesses, mostly wrong#

The actual problem#

What I got wrong last October#

Monitoring that missed it#

The fix#

What is different now#

On defaults#