When somebody hands me a Postgres database and says “it’s slow,” I work through the same checklist almost every time. None of these are exotic. They’re the boring, high-leverage things that solve maybe 80 percent of the cases I see before anyone has to think hard.

1. EXPLAIN (ANALYZE, BUFFERS) before anything else

Theories about why a query is slow are mostly worthless. Run the query with EXPLAIN (ANALYZE, BUFFERS) and read the output.

EXPLAIN (ANALYZE, BUFFERS)
SELECT u.email, COUNT(o.id)
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.signup_at > now() - interval '30 days'
GROUP BY u.email;

What I look for, in order:

  • Sequential scans on big tables. A Seq Scan over millions of rows for a query that should hit an index is the most common single cause of slowness.
  • Rows estimated vs rows actual. A 1000x mismatch between (rows=...) estimates and (actual rows=...) is the planner being misled, usually by stale statistics or a correlated predicate it can’t model.
  • Buffers: read vs hit. read is disk, hit is cache. A query reading thousands of buffers from disk that should be in memory tells you the buffer cache is too small or the data is colder than you thought.

Read the plan from the bottom up. The deepest node runs first.

2. Indexes that match the query

Postgres can use a single index for a WHERE clause, an ORDER BY, or both — but only if the index column order matches. The rule of thumb:

  • Equality columns first.
  • Then range columns.
  • Then sort columns.

For this query:

SELECT * FROM events
WHERE tenant_id = $1 AND created_at >= $2
ORDER BY created_at DESC
LIMIT 100;

The right index is (tenant_id, created_at DESC). Not (created_at, tenant_id). Not two separate single-column indexes. The planner can sometimes combine indexes with a BitmapAnd, but it’s almost always worse than a composite index designed for the query.

A trick worth knowing: include columns can avoid table lookups entirely if the query only reads a few fields:

CREATE INDEX events_tenant_time_idx
ON events (tenant_id, created_at DESC)
INCLUDE (event_type, payload_size);

Now a query that only needs those two extra fields can answer from the index alone — an “index-only scan.” Visible in the plan as Index Only Scan.

3. VACUUM and dead tuples

Postgres uses MVCC, which means an UPDATE doesn’t overwrite the row — it writes a new version and marks the old one dead. VACUUM reclaims that space. If autovacuum can’t keep up, tables bloat, indexes bloat, and queries slow down even though the live row count looks fine.

Quick check:

SELECT relname,
       n_live_tup,
       n_dead_tup,
       round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS pct_dead
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;

Anything over about 20 percent dead is worth investigating. The fix is usually one of: tune autovacuum to run more aggressively on that table, batch-update less aggressively, or accept it and run VACUUM (FULL) during a maintenance window. FULL rewrites the whole table and takes an exclusive lock, so don’t run it casually.

4. work_mem and the spill cliff

work_mem is the amount of memory each query operation can use before it spills to temporary files on disk. Default is 4MB on most installations, which is far too low for any analytical workload.

Symptoms of a too-small work_mem:

  • Sort or hash operations show Disk: ... kB in EXPLAIN ANALYZE.
  • The temp_files and temp_bytes counters in pg_stat_database are growing.

Bumping it to 64MB or 128MB at the session or role level usually makes a big difference for reporting queries. Don’t bump it globally — work_mem is per operation per query, so a connection running a query with three sort operations uses three times that.

5. Connection count

Postgres is process-per-connection. Two thousand idle connections can cripple a server that handles two hundred fine. The fix is a connection pooler — PgBouncer in transaction-pooling mode is the usual answer.

The signal you’ll see in pg_stat_activity:

SELECT state, count(*)
FROM pg_stat_activity
GROUP BY state;

A pile of idle in transaction rows is especially bad — those connections are holding row locks and preventing vacuum from cleaning up dead tuples behind them.

What I deliberately don’t tune first

  • shared_buffers — defaults are sane, and changing it requires a restart.
  • Custom planner parameters — enable_seqscan = off and friends are bandaids; the real fix is usually statistics or indexes.
  • Partitioning — useful when the table is genuinely huge, painful otherwise. Not a first move.

The meta-lesson

Most “slow Postgres” tickets I’ve worked have a small handful of root causes: missing or wrong index, stale statistics, table bloat, undersized work_mem, or runaway connection count. Always run the actual plan first. Don’t tune anything you haven’t measured.