Worker pools are one of those patterns that looks trivial in a blog post and becomes surprisingly difficult in production. The version I keep coming back to is built around three rules: a single owner, explicit cancellation, and a bounded queue. This post walks through the version I actually paste into projects.

The naive version

Most worker pool examples on the internet look like this:

jobs := make(chan Job, 100)
for i := 0; i < 8; i++ {
    go func() {
        for j := range jobs {
            process(j)
        }
    }()
}

This works for the happy path and falls apart everywhere else. There’s no way to wait for the workers to finish, no way to stop them early, and no way to know if process panicked. If the producer crashes, the workers stay alive forever.

The version I use

Here’s the shape I prefer:

type Pool struct {
    sem chan struct{}
    wg  sync.WaitGroup
}

func NewPool(workers int) *Pool {
    return &Pool{sem: make(chan struct{}, workers)}
}

func (p *Pool) Submit(ctx context.Context, fn func(context.Context)) error {
    select {
    case p.sem <- struct{}{}:
    case <-ctx.Done():
        return ctx.Err()
    }
    p.wg.Add(1)
    go func() {
        defer p.wg.Done()
        defer func() { <-p.sem }()
        defer func() {
            if r := recover(); r != nil {
                log.Printf("worker panic: %v", r)
            }
        }()
        fn(ctx)
    }()
    return nil
}

func (p *Pool) Wait() { p.wg.Wait() }

A few things worth pointing out.

The semaphore is the queue. I used to use a buffered channel of Job values, but that mixes two concerns: how many workers run concurrently, and how many jobs can be queued. Using a semaphore-style channel makes the concurrency limit explicit and the queue depth zero — Submit blocks until a worker slot is free or the context is canceled.

Submit takes a context. This means callers can give up if the pool is saturated. That’s almost always what you want. A Submit that blocks forever is a leak waiting to happen, and “wait for ten seconds then give up” is hard to retrofit later.

The worker function takes a context too. It’s the same context the caller passed to Submit, which means cancellation propagates from caller → pool → work. This is the part most naive implementations miss.

Recover is on by default. A panic in one job should not kill the worker permanently or take down the process. Log it, drop the job, move on.

Pattern: fan-out, fan-in with errors

When you have a list of items to process and want to collect either results or the first error, this is what I write:

func processAll(ctx context.Context, items []Item) error {
    g, ctx := errgroup.WithContext(ctx)
    g.SetLimit(8)
    for _, it := range items {
        it := it
        g.Go(func() error {
            return processOne(ctx, it)
        })
    }
    return g.Wait()
}

errgroup.WithContext cancels the derived context as soon as any goroutine returns an error, which means the other workers will see ctx.Done() close and can bail out cleanly. SetLimit was added in Go 1.20 and saves you from having to wire up the semaphore yourself.

The deadlock you’ll hit eventually

A subtle deadlock with worker pools: a worker submits a new job to the same pool. If all workers are busy and the queue is full, the worker blocks forever waiting for a slot that will never free up because every slot is held by a worker waiting for a slot.

There are two clean fixes:

  • Use two pools — one for top-level work, one for sub-work.
  • Set a deadline on the inner Submit so it gives up rather than deadlocking.

I prefer the first. Once you start needing recursive submission, the workload is usually structured enough that two pools is the more honest answer.

What about channels?

The pure CSP-style approach — pass jobs over a channel, range over it, close it to signal shutdown — still has its place. It’s clearer for streaming pipelines where the work is shaped like a conveyor belt. For request/response style work where each call has its own lifetime and context, the semaphore-plus-goroutine version above is simpler and composes better with context.Context.

Takeaways

  • Always tie work to a context.Context. No exceptions.
  • Make the concurrency limit a first-class field, not an emergent property of channel buffer sizes.
  • Never assume jobs won’t panic. They will.
  • errgroup is the right answer for “do all of these in parallel and stop on first error.”