Observability with OpenTelemetry: What's Worth It and What Isn't

I’ve adopted OpenTelemetry on three different services over the last two years, with varying success. This post is the honest version of what I’ve learned: where it pays off, where the friction lives, and what I would do differently next time.

The three signals, ranked by effort vs payoff

OTel covers traces, metrics, and logs. They are not equally easy to adopt and they don’t deliver equal value.

Traces — high payoff, moderate effort. The first time you see a flame graph showing exactly where a slow request spent its 800ms, all the instrumentation work feels worth it. Auto-instrumentation libraries cover most of what you need (HTTP, database, Redis, gRPC) so the upfront cost is mostly: install SDK, configure exporter, deploy. Worth doing first.

Metrics — high payoff, low effort if you already have Prometheus. OTel metrics are a more standardized way of expressing the same things you’d write to Prometheus. If you already have Prometheus working, the gain is mostly that you stop writing labels by hand and you get the metric definitions in a portable format. If you don’t have Prometheus, doing OTel metrics first makes a lot of sense.

Logs — moderate payoff, high effort. OTel logs are the newest of the three signals and the integration story is the messiest. Most languages still recommend keeping your existing structured logger and just letting OTel correlate trace IDs into log records. I would not migrate a working logging stack to OTel logs as a first step.

Setup that actually works

A configuration that I keep coming back to:

In-process SDK with auto-instrumentation for HTTP/DB/RPC.
OTLP exporter sending to a local OTel Collector.
Collector runs as a sidecar or DaemonSet and forwards to whatever backend you actually use.

Code-side, in Go, the minimum looks like:

exp, err := otlptracehttp.New(ctx,
    otlptracehttp.WithEndpoint("localhost:4318"),
    otlptracehttp.WithInsecure(),
)
if err != nil { return err }

tp := sdktrace.NewTracerProvider(
    sdktrace.WithBatcher(exp),
    sdktrace.WithResource(resource.NewWithAttributes(
        semconv.SchemaURL,
        semconv.ServiceName("api-gateway"),
        semconv.ServiceVersion(buildVersion),
    )),
    sdktrace.WithSampler(sdktrace.TraceIDRatioBased(0.1)),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.TraceContext{})

A few things this gets right that I see done wrong a lot.

Use a Collector, not direct-to-vendor. Sending OTLP to a local Collector means you can change vendors by editing one config file, instead of redeploying every service. It also gives you a place to do tail sampling, add resource attributes, and drop spans you don’t care about.

Sample at the head, refine at the tail. A 10 percent head sample (TraceIDRatioBased(0.1)) keeps cost manageable. The Collector’s tail_sampling processor can then keep 100 percent of error traces and slow traces. Net result: you see all the interesting stuff and pay for a tenth of the boring stuff.

Set service.name and service.version. Almost every dashboard and trace search you’ll ever do filters on these. Forgetting service.version makes it impossible to correlate a regression with a specific release.

What I’d skip

A few things I tried and would not bother with again:

Manual span creation everywhere. Auto-instrumentation gets you 80 percent of the value. Adding manual spans for every internal function call buries the actually interesting spans in noise. I now only add manual spans around discrete units of work that take more than ~10ms or that have business meaning.
Sending traces from short-lived CLI jobs. The export-on-shutdown dance with a low timeout is fragile, and you lose traces on crash. Use logs and metrics for batch jobs.
Custom resource detection beyond what the SDK gives you. The default detectors cover Kubernetes, EC2, GCP, and Azure metadata. Writing your own detector is rarely worth the maintenance.

The cost trap

Tracing is shockingly expensive at full sample rate. Vendors charge per ingested span, and a single HTTP request through five services can produce dozens of spans. A service handling 1000 requests per second at full sampling can easily cost more in observability bills than in compute.

A reasonable starting policy:

100 percent of error spans (tail-sample on status.code = error).
100 percent of spans for requests over a latency threshold (e.g. p99).
5–10 percent of everything else.

This gets you the diagnostic value of full sampling at a fraction of the cost.

Correlation between signals

The single most useful thing OTel gets you, if you wire it up right, is trace ID in your logs. When an error happens, you can jump from the alert, to the log line, to the trace, to the database query, in about ten seconds. Without trace correlation you’d be reconstructing it from timestamps for half an hour.

Most logging libraries support this with a small adapter. In Go with slog:

func logAttrsFromCtx(ctx context.Context) []slog.Attr {
    sc := trace.SpanContextFromContext(ctx)
    if !sc.IsValid() { return nil }
    return []slog.Attr{
        slog.String("trace_id", sc.TraceID().String()),
        slog.String("span_id",  sc.SpanID().String()),
    }
}

Bottom line

OpenTelemetry is good. It is also more work than the marketing suggests. If I were starting fresh today: traces with auto-instrumentation first, then metrics through the Collector, then add trace-ID correlation to existing logs. Skip OTel logs as a migration target until you have a real reason to. Sample aggressively. Watch the bill.

The three signals, ranked by effort vs payoff#

Setup that actually works#

What I’d skip#

The cost trap#

Correlation between signals#

Bottom line#