09:00 AM
Marketing sends the blast.
09:15 AM
The load balancer gives up.
The application servers were alive (CPU at 40%), but Nginx was dropping connections. Here is the autopsy of the failure and the configuration changes that fixed it.
The Symptom
We were managing a FinTech application scaling for a Series B launch event. Traffic surged by roughly 300% in 10 minutes. While the Ruby on Rails application servers were handling the throughput, the Nginx reverse proxy sitting in front began throwing sporadic 502 Bad Gateway errors.
The immediate assumption was resource exhaustion. We scaled the droplets vertically. No change. The 502s persisted even as CPU and RAM utilization dropped.
The Investigation
I stopped looking at the graphs and started tailing the Nginx error logs. Amidst the noise, one line repeated every time a 502 occurred:
The culprit wasn't load; it was buffer size. The application uses large JWTs (JSON Web Tokens) and sets several cookies for tracking user sessions. Under normal load, these headers fit within Nginx’s default buffers.
However, during the high-load event, the application was also appending additional debugging headers and larger session payloads. Nginx’s default buffer (usually 4k or 8k, depending on the architecture) was choking on the response headers from the Rails app, causing it to sever the connection abruptly.
The Solution
We needed to explicitly tell Nginx to allocate more memory for reading upstream headers. We modified the `nginx.conf` inside the `http` or `server` block.
location / {
proxy_pass http://backend_upstream;
# Increase buffer size for headers
proxy_buffer_size 128k;
proxy_buffers 4 256k;
proxy_busy_buffers_size 256k;
# Standard proxy settings
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
Secondary Optimization: Postgres Pooling
Once the 502s resolved, we hit the next bottleneck: Database connections. The Rails app, now effectively serving requests again, immediately exhausted the PostgreSQL `max_connections` limit.
Instead of increasing `max_connections` (which eats RAM per connection), we deployed PgBouncer as a lightweight connection pooler.
- Before: 100 active connections (Direct to Postgres)
- After: 1000 client connections multiplexed into 20 active Postgres connections
The Outcome
Following the `service nginx reload` and the PgBouncer deployment, the error rate dropped to 0% within 30 seconds. The system successfully handled the remaining 4 hours of the launch event with a p95 latency of 120ms.
Lesson: Default configurations are designed for "Hello World," not production scale. Validate your buffer sizes and connection limits before the traffic hits.