June 19, 2026

Rate limiting a public API: token bucket vs sliding window

backendapi

Rate limiting cover graphic for erkshitiz.com.np

When we opened up our API to third-party clients, the first question wasn’t “how do we rate limit this” but “which kind of rate limiting actually matches how people use it.” Most clients call the API in short bursts, a batch job kicks off and fires twenty requests in a second, then goes quiet for a minute. A naive fixed cap per second would have rejected half of that traffic for no good reason, so we ended up comparing two approaches properly: token bucket and sliding window.

Token bucket works like its name suggests. There is a bucket with a maximum capacity, and it refills at a fixed rate, say ten tokens per second, up to a cap of fifty. Every request costs one token. If the bucket is empty, the request gets rejected or queued. The part that matters is that the bucket can hold up to fifty tokens even if the client has been idle, so a client that hasn’t called the API in a while can burst up to fifty requests instantly before it’s throttled back down to the steady ten per second refill rate.

Go’s standard library ships almost exactly this in golang.org/x/time/rate, so most of the time you don’t need to hand-roll it:

import "golang.org/x/time/rate"

limiter := rate.NewLimiter(rate.Limit(10), 50) // 10 tokens/sec, burst of 50

func handler(w http.ResponseWriter, r *http.Request) {
	if !limiter.Allow() {
		http.Error(w, "rate limit exceeded", http.StatusTooManyRequests)
		return
	}
	// handle the request
}

Sliding window is stricter. Instead of a bucket that can build up unused capacity, it counts how many requests happened in the last N seconds, on a rolling basis, and rejects anything past the cap. There’s no accumulated slack. A client that made forty requests in the last thirty seconds and tries a forty-first gets rejected, full stop, even if it was quiet for the thirty seconds before that window started. The tradeoff is a much smoother, more predictable load on your backend, at the cost of punishing legitimate bursty clients that token bucket would have let through.

Which one to use comes down to what you’re protecting. If the resource behind the endpoint can absorb short bursts fine (a database with headroom, a cache-backed read), token bucket gives clients a better experience without meaningfully hurting you. If the endpoint is expensive per call, something that hits a rate-limited upstream vendor API or does real compute, a sliding window keeps your worst-case load bounded and easier to reason about.

Neither of these works cleanly once you have more than one API instance, though, because each instance would track its own bucket or window and a client could get roughly N times the limit by hitting different instances. The usual fix is to move the counter into Redis. For a rough token bucket, a Lua script that atomically checks and decrements a key with a TTL gets you close enough for most APIs. For a proper sliding window, a sorted set keyed by client ID works well: push the current timestamp as a member on every request, trim anything older than the window on each check, and reject if the remaining count is over the cap.

ZADD ratelimit:client123 <timestamp> <timestamp>
ZREMRANGEBYSCORE ratelimit:client123 -inf (<timestamp> - window)
ZCARD ratelimit:client123

That gives you an accurate rolling count across every instance, at the cost of a round trip to Redis per request, which is usually a fine trade for a rate limiter sitting in front of expensive work.

If I had to pick a default for a public API without knowing much else about it, I’d reach for token bucket. Real clients are bursty by nature, retries and batch jobs cluster requests together, and a bucket that tolerates that is friendlier without giving up meaningful protection, as long as you size the burst capacity sensibly. Save sliding window for the specific endpoints where a smooth, hard cap actually matters more than a good client experience.