Return to Base
Field Note #004
Nginx Postgres Infrastructure

Mitigating 502 Bad Gateway errors during high-load traffic spikes

Dec 12, 2024 Diagnostic Log

09:00 AM Marketing sends the blast.
09:15 AM The load balancer gives up.

The application servers were alive (CPU at 40%), but Nginx was dropping connections. Here is the autopsy of the failure and the configuration changes that fixed it.

The Symptom

We were managing a FinTech application scaling for a Series B launch event. Traffic surged by roughly 300% in 10 minutes. While the Ruby on Rails application servers were handling the throughput, the Nginx reverse proxy sitting in front began throwing sporadic 502 Bad Gateway errors.

The immediate assumption was resource exhaustion. We scaled the droplets vertically. No change. The 502s persisted even as CPU and RAM utilization dropped.

The Investigation

I stopped looking at the graphs and started tailing the Nginx error logs. Amidst the noise, one line repeated every time a 502 occurred:

[error] 1923#0: *451 upstream sent too big header while reading response header from upstream

The culprit wasn't load; it was buffer size. The application uses large JWTs (JSON Web Tokens) and sets several cookies for tracking user sessions. Under normal load, these headers fit within Nginx’s default buffers.

However, during the high-load event, the application was also appending additional debugging headers and larger session payloads. Nginx’s default buffer (usually 4k or 8k, depending on the architecture) was choking on the response headers from the Rails app, causing it to sever the connection abruptly.

The Solution

We needed to explicitly tell Nginx to allocate more memory for reading upstream headers. We modified the `nginx.conf` inside the `http` or `server` block.

/etc/nginx/sites-available/default
location / {
    proxy_pass http://backend_upstream;

    # Increase buffer size for headers
    proxy_buffer_size          128k;
    proxy_buffers              4 256k;
    proxy_busy_buffers_size    256k;

    # Standard proxy settings
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
}

Secondary Optimization: Postgres Pooling

Once the 502s resolved, we hit the next bottleneck: Database connections. The Rails app, now effectively serving requests again, immediately exhausted the PostgreSQL `max_connections` limit.

Instead of increasing `max_connections` (which eats RAM per connection), we deployed PgBouncer as a lightweight connection pooler.

  • Before: 100 active connections (Direct to Postgres)
  • After: 1000 client connections multiplexed into 20 active Postgres connections

The Outcome

Following the `service nginx reload` and the PgBouncer deployment, the error rate dropped to 0% within 30 seconds. The system successfully handled the remaining 4 hours of the launch event with a p95 latency of 120ms.

Lesson: Default configurations are designed for "Hello World," not production scale. Validate your buffer sizes and connection limits before the traffic hits.

Is your infrastructure brittle?

I help teams diagnose hidden bottlenecks and configuration limits before they become outages.

Request System Audit
End of Log