May 22, 2026

What actually breaks when you scale a Postgres-backed API

backendpostgresdatabase
Postgres scaling cover graphic for erkshitiz.com.np

We doubled our traffic in about six weeks after a marketing push landed better than anyone expected. The API held up fine for the first few days, then started throwing timeouts in bursts, always during the morning peak. Nothing in the app code had changed. The database had just quietly become the bottleneck, in three different ways, one after another.

The first failure was the loudest: requests started failing with sorry, too many clients already. Postgres has a hard max_connections limit, and every app server instance was opening its own pool. With five instances each holding a pool of twenty connections, we were sitting at a hundred connections against a max_connections of 100, before a single admin tool or migration job connected. Any small spike in latency meant requests piled up faster than they drained, and the pool just ran out of room.

The real fix wasn’t raising max_connections. Postgres connections are expensive, each one is a full backend process, and cranking the limit up just delays the same wall while eating more memory. We put pgbouncer in front of the database in transaction pooling mode instead, so the app servers could each keep a generous pool of logical connections while pgbouncer multiplexed them down to a much smaller number of real Postgres backends. That one change alone took us from constant connection errors to zero, with room to add more app servers without touching the database config again.

Once the connection errors stopped, a different endpoint got slow instead, one that listed orders by customer. It had always been “fine,” which in hindsight meant nobody had looked at it since the table was small. I ran it through EXPLAIN ANALYZE:

EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 4821 ORDER BY created_at DESC;

Seq Scan on orders  (cost=0.00..48213.00 rows=12 width=214) (actual time=312.4..890.1 rows=12 loops=1)
  Filter: (customer_id = 4821)
  Rows Removed by Filter: 3998211
Planning Time: 0.112 ms
Execution Time: 890.4 ms

A sequential scan over four million rows to return twelve of them. The table had grown past the point where a full scan was survivable, and there was no index on customer_id to lean on. The fix was a single statement:

CREATE INDEX CONCURRENTLY idx_orders_customer_id_created_at
  ON orders (customer_id, created_at DESC);

The CONCURRENTLY keyword mattered as much as the index itself. A plain CREATE INDEX takes a lock that blocks writes to the table for however long the build takes, which on a four-million-row table under live traffic would have caused its own outage. With the index in place, that same query dropped from 890ms to under 2ms.

The third one was subtler and took longer to find, an N+1 query pattern buried in the order details page. The handler fetched a list of orders, then looped over them to fetch each order’s line items separately:

orders, _ := db.Query("SELECT * FROM orders WHERE customer_id = $1", custID)
for _, o := range orders {
    items, _ := db.Query("SELECT * FROM line_items WHERE order_id = $1", o.ID)
    o.Items = items
}

For a customer with thirty orders, that’s thirty-one round trips to the database instead of two. It worked fine in testing with a handful of orders and fell apart for the customers who had hundreds. The fix was to batch the second query with ANY:

orders, _ := db.Query("SELECT * FROM orders WHERE customer_id = $1", custID)
ids := orderIDs(orders)
items, _ := db.Query("SELECT * FROM line_items WHERE order_id = ANY($1)", ids)
attachItems(orders, items)

Two queries no matter how many orders the customer has, and the difference showed up immediately on the slowest accounts.

None of these three were exotic. Every one of them is a well-known failure mode with a well-known fix, and none of them showed up in code review because the code was correct, it just didn’t scale. The lesson I took from that week is that “works in testing” and “works under load” are different claims, and the only way to know which one you actually have is to look at what the database is doing under real traffic before it forces you to look.