April 17, 2026

Debugging a goroutine leak in production

godebuggingbackend
Goroutine Leak cover graphic for erkshitiz.com.np

One of our Go services started climbing in memory a few hours after every deploy. Nothing crashed, nothing errored, it just got slower and slower until we restarted it and the cycle began again. Classic slow leak, and the kind that is easy to miss in staging because nothing there runs long enough to show it.

The first useful signal came from pprof, which is already wired into most of our services behind an internal-only route:

import _ "net/http/pprof"

func main() {
	go func() {
		log.Println(http.ListenAndServe("localhost:6060", nil))
	}()
	// ... rest of the service
}

A quick look at the goroutine count over time confirmed it was not a memory leak in the traditional sense, it was a goroutine leak. Every request handled by one particular endpoint was leaving a goroutine behind, and those goroutines were never getting garbage collected because they were still alive, just permanently blocked.

The culprit was a fan-out pattern that read from a channel without any way to stop:

func fetchAll(ids []string) []Result {
	results := make(chan Result)

	for _, id := range ids {
		go func(id string) {
			results <- fetchOne(id)
		}(id)
	}

	out := make([]Result, 0, len(ids))
	for i := 0; i < len(ids); i++ {
		out = append(out, <-results)
	}
	return out
}

This looks fine until one of the fetchOne calls hangs, which happened whenever a downstream service timed out badly instead of returning an error. The receiving loop only reads len(ids) times, so if even one goroutine never sends, its sibling that already sent is fine, but the hung one sits there forever, blocked on a channel send nobody is going to read again.

The fix was to give every goroutine a way to bail out, using a context with a timeout and a select instead of a blind channel send:

func fetchAll(ctx context.Context, ids []string) []Result {
	ctx, cancel := context.WithTimeout(ctx, 3*time.Second)
	defer cancel()

	results := make(chan Result, len(ids))

	for _, id := range ids {
		go func(id string) {
			select {
			case results <- fetchOne(id):
			case <-ctx.Done():
			}
		}(id)
	}

	out := make([]Result, 0, len(ids))
	for i := 0; i < len(ids); i++ {
		select {
		case r := <-results:
			out = append(out, r)
		case <-ctx.Done():
			return out
		}
	}
	return out
}

Buffering the channel also matters here: without it, a goroutine that loses the race with ctx.Done() on the receiving side would still block on the send forever, which is exactly the leak we started with.

The lesson that stuck with me: any time a goroutine can outlive the function that spawned it, it needs its own exit condition. A channel with no buffer and no cancellation path is an easy way to end up with goroutines that live until the process restarts.