[{"content":"I\u0026rsquo;ve adopted OpenTelemetry on three different services over the last two years, with varying success. This post is the honest version of what I\u0026rsquo;ve learned: where it pays off, where the friction lives, and what I would do differently next time.\nThe three signals, ranked by effort vs payoff OTel covers traces, metrics, and logs. They are not equally easy to adopt and they don\u0026rsquo;t deliver equal value.\nTraces — high payoff, moderate effort. The first time you see a flame graph showing exactly where a slow request spent its 800ms, all the instrumentation work feels worth it. Auto-instrumentation libraries cover most of what you need (HTTP, database, Redis, gRPC) so the upfront cost is mostly: install SDK, configure exporter, deploy. Worth doing first.\nMetrics — high payoff, low effort if you already have Prometheus. OTel metrics are a more standardized way of expressing the same things you\u0026rsquo;d write to Prometheus. If you already have Prometheus working, the gain is mostly that you stop writing labels by hand and you get the metric definitions in a portable format. If you don\u0026rsquo;t have Prometheus, doing OTel metrics first makes a lot of sense.\nLogs — moderate payoff, high effort. OTel logs are the newest of the three signals and the integration story is the messiest. Most languages still recommend keeping your existing structured logger and just letting OTel correlate trace IDs into log records. I would not migrate a working logging stack to OTel logs as a first step.\nSetup that actually works A configuration that I keep coming back to:\nIn-process SDK with auto-instrumentation for HTTP/DB/RPC. OTLP exporter sending to a local OTel Collector. Collector runs as a sidecar or DaemonSet and forwards to whatever backend you actually use. Code-side, in Go, the minimum looks like:\nexp, err := otlptracehttp.New(ctx, otlptracehttp.WithEndpoint(\u0026#34;localhost:4318\u0026#34;), otlptracehttp.WithInsecure(), ) if err != nil { return err } tp := sdktrace.NewTracerProvider( sdktrace.WithBatcher(exp), sdktrace.WithResource(resource.NewWithAttributes( semconv.SchemaURL, semconv.ServiceName(\u0026#34;api-gateway\u0026#34;), semconv.ServiceVersion(buildVersion), )), sdktrace.WithSampler(sdktrace.TraceIDRatioBased(0.1)), ) otel.SetTracerProvider(tp) otel.SetTextMapPropagator(propagation.TraceContext{}) A few things this gets right that I see done wrong a lot.\nUse a Collector, not direct-to-vendor. Sending OTLP to a local Collector means you can change vendors by editing one config file, instead of redeploying every service. It also gives you a place to do tail sampling, add resource attributes, and drop spans you don\u0026rsquo;t care about.\nSample at the head, refine at the tail. A 10 percent head sample (TraceIDRatioBased(0.1)) keeps cost manageable. The Collector\u0026rsquo;s tail_sampling processor can then keep 100 percent of error traces and slow traces. Net result: you see all the interesting stuff and pay for a tenth of the boring stuff.\nSet service.name and service.version. Almost every dashboard and trace search you\u0026rsquo;ll ever do filters on these. Forgetting service.version makes it impossible to correlate a regression with a specific release.\nWhat I\u0026rsquo;d skip A few things I tried and would not bother with again:\nManual span creation everywhere. Auto-instrumentation gets you 80 percent of the value. Adding manual spans for every internal function call buries the actually interesting spans in noise. I now only add manual spans around discrete units of work that take more than ~10ms or that have business meaning. Sending traces from short-lived CLI jobs. The export-on-shutdown dance with a low timeout is fragile, and you lose traces on crash. Use logs and metrics for batch jobs. Custom resource detection beyond what the SDK gives you. The default detectors cover Kubernetes, EC2, GCP, and Azure metadata. Writing your own detector is rarely worth the maintenance. The cost trap Tracing is shockingly expensive at full sample rate. Vendors charge per ingested span, and a single HTTP request through five services can produce dozens of spans. A service handling 1000 requests per second at full sampling can easily cost more in observability bills than in compute.\nA reasonable starting policy:\n100 percent of error spans (tail-sample on status.code = error). 100 percent of spans for requests over a latency threshold (e.g. p99). 5–10 percent of everything else. This gets you the diagnostic value of full sampling at a fraction of the cost.\nCorrelation between signals The single most useful thing OTel gets you, if you wire it up right, is trace ID in your logs. When an error happens, you can jump from the alert, to the log line, to the trace, to the database query, in about ten seconds. Without trace correlation you\u0026rsquo;d be reconstructing it from timestamps for half an hour.\nMost logging libraries support this with a small adapter. In Go with slog:\nfunc logAttrsFromCtx(ctx context.Context) []slog.Attr { sc := trace.SpanContextFromContext(ctx) if !sc.IsValid() { return nil } return []slog.Attr{ slog.String(\u0026#34;trace_id\u0026#34;, sc.TraceID().String()), slog.String(\u0026#34;span_id\u0026#34;, sc.SpanID().String()), } } Bottom line OpenTelemetry is good. It is also more work than the marketing suggests. If I were starting fresh today: traces with auto-instrumentation first, then metrics through the Collector, then add trace-ID correlation to existing logs. Skip OTel logs as a migration target until you have a real reason to. Sample aggressively. Watch the bill.\n","permalink":"https://oypron.com/posts/observability-with-opentelemetry/","summary":"\u003cp\u003eI\u0026rsquo;ve adopted OpenTelemetry on three different services over the last two years, with varying success. This post is the honest version of what I\u0026rsquo;ve learned: where it pays off, where the friction lives, and what I would do differently next time.\u003c/p\u003e\n\u003ch2 id=\"the-three-signals-ranked-by-effort-vs-payoff\"\u003eThe three signals, ranked by effort vs payoff\u003c/h2\u003e\n\u003cp\u003eOTel covers traces, metrics, and logs. They are not equally easy to adopt and they don\u0026rsquo;t deliver equal value.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTraces — high payoff, moderate effort.\u003c/strong\u003e The first time you see a flame graph showing exactly where a slow request spent its 800ms, all the instrumentation work feels worth it. Auto-instrumentation libraries cover most of what you need (HTTP, database, Redis, gRPC) so the upfront cost is mostly: install SDK, configure exporter, deploy. Worth doing first.\u003c/p\u003e","title":"Observability with OpenTelemetry: What's Worth It and What Isn't"},{"content":"When somebody hands me a Postgres database and says \u0026ldquo;it\u0026rsquo;s slow,\u0026rdquo; I work through the same checklist almost every time. None of these are exotic. They\u0026rsquo;re the boring, high-leverage things that solve maybe 80 percent of the cases I see before anyone has to think hard.\n1. EXPLAIN (ANALYZE, BUFFERS) before anything else Theories about why a query is slow are mostly worthless. Run the query with EXPLAIN (ANALYZE, BUFFERS) and read the output.\nEXPLAIN (ANALYZE, BUFFERS) SELECT u.email, COUNT(o.id) FROM users u LEFT JOIN orders o ON o.user_id = u.id WHERE u.signup_at \u0026gt; now() - interval \u0026#39;30 days\u0026#39; GROUP BY u.email; What I look for, in order:\nSequential scans on big tables. A Seq Scan over millions of rows for a query that should hit an index is the most common single cause of slowness. Rows estimated vs rows actual. A 1000x mismatch between (rows=...) estimates and (actual rows=...) is the planner being misled, usually by stale statistics or a correlated predicate it can\u0026rsquo;t model. Buffers: read vs hit. read is disk, hit is cache. A query reading thousands of buffers from disk that should be in memory tells you the buffer cache is too small or the data is colder than you thought. Read the plan from the bottom up. The deepest node runs first.\n2. Indexes that match the query Postgres can use a single index for a WHERE clause, an ORDER BY, or both — but only if the index column order matches. The rule of thumb:\nEquality columns first. Then range columns. Then sort columns. For this query:\nSELECT * FROM events WHERE tenant_id = $1 AND created_at \u0026gt;= $2 ORDER BY created_at DESC LIMIT 100; The right index is (tenant_id, created_at DESC). Not (created_at, tenant_id). Not two separate single-column indexes. The planner can sometimes combine indexes with a BitmapAnd, but it\u0026rsquo;s almost always worse than a composite index designed for the query.\nA trick worth knowing: include columns can avoid table lookups entirely if the query only reads a few fields:\nCREATE INDEX events_tenant_time_idx ON events (tenant_id, created_at DESC) INCLUDE (event_type, payload_size); Now a query that only needs those two extra fields can answer from the index alone — an \u0026ldquo;index-only scan.\u0026rdquo; Visible in the plan as Index Only Scan.\n3. VACUUM and dead tuples Postgres uses MVCC, which means an UPDATE doesn\u0026rsquo;t overwrite the row — it writes a new version and marks the old one dead. VACUUM reclaims that space. If autovacuum can\u0026rsquo;t keep up, tables bloat, indexes bloat, and queries slow down even though the live row count looks fine.\nQuick check:\nSELECT relname, n_live_tup, n_dead_tup, round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS pct_dead FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10; Anything over about 20 percent dead is worth investigating. The fix is usually one of: tune autovacuum to run more aggressively on that table, batch-update less aggressively, or accept it and run VACUUM (FULL) during a maintenance window. FULL rewrites the whole table and takes an exclusive lock, so don\u0026rsquo;t run it casually.\n4. work_mem and the spill cliff work_mem is the amount of memory each query operation can use before it spills to temporary files on disk. Default is 4MB on most installations, which is far too low for any analytical workload.\nSymptoms of a too-small work_mem:\nSort or hash operations show Disk: ... kB in EXPLAIN ANALYZE. The temp_files and temp_bytes counters in pg_stat_database are growing. Bumping it to 64MB or 128MB at the session or role level usually makes a big difference for reporting queries. Don\u0026rsquo;t bump it globally — work_mem is per operation per query, so a connection running a query with three sort operations uses three times that.\n5. Connection count Postgres is process-per-connection. Two thousand idle connections can cripple a server that handles two hundred fine. The fix is a connection pooler — PgBouncer in transaction-pooling mode is the usual answer.\nThe signal you\u0026rsquo;ll see in pg_stat_activity:\nSELECT state, count(*) FROM pg_stat_activity GROUP BY state; A pile of idle in transaction rows is especially bad — those connections are holding row locks and preventing vacuum from cleaning up dead tuples behind them.\nWhat I deliberately don\u0026rsquo;t tune first shared_buffers — defaults are sane, and changing it requires a restart. Custom planner parameters — enable_seqscan = off and friends are bandaids; the real fix is usually statistics or indexes. Partitioning — useful when the table is genuinely huge, painful otherwise. Not a first move. The meta-lesson Most \u0026ldquo;slow Postgres\u0026rdquo; tickets I\u0026rsquo;ve worked have a small handful of root causes: missing or wrong index, stale statistics, table bloat, undersized work_mem, or runaway connection count. Always run the actual plan first. Don\u0026rsquo;t tune anything you haven\u0026rsquo;t measured.\n","permalink":"https://oypron.com/posts/postgres-performance-tuning/","summary":"\u003cp\u003eWhen somebody hands me a Postgres database and says \u0026ldquo;it\u0026rsquo;s slow,\u0026rdquo; I work through the same checklist almost every time. None of these are exotic. They\u0026rsquo;re the boring, high-leverage things that solve maybe 80 percent of the cases I see before anyone has to think hard.\u003c/p\u003e\n\u003ch2 id=\"1-explain-analyze-buffers-before-anything-else\"\u003e1. EXPLAIN (ANALYZE, BUFFERS) before anything else\u003c/h2\u003e\n\u003cp\u003eTheories about why a query is slow are mostly worthless. Run the query with \u003ccode\u003eEXPLAIN (ANALYZE, BUFFERS)\u003c/code\u003e and read the output.\u003c/p\u003e","title":"Postgres Performance Tuning: The First Five Things I Check"},{"content":"The first time I tried to enforce NetworkPolicy in a real cluster, I broke DNS for an entire namespace and spent the next forty minutes wondering why every pod was returning i/o timeout. This post is the guide I wish I had read first.\nThe mental model A NetworkPolicy is a label-selector-driven firewall rule. It only does two things:\nSelects pods by label (in a single namespace). Specifies allowed ingress, egress, or both for those pods. Three rules that took me too long to internalize:\nPolicies are additive within a direction. If pod X is selected by two policies, the allowed ingress is the union. Policies are subtractive across directions. Once any policy with policyTypes: [Ingress] selects a pod, all other ingress is denied. Same for egress. Policies are namespace-scoped. A policy in namespace app does not affect pods in namespace db. Cross-namespace rules use namespaceSelector. The implication is important: there is no global \u0026ldquo;default-deny\u0026rdquo; switch. You build default-deny by writing a policy that selects every pod and allows nothing.\nDefault-deny: the starting point apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-deny-all namespace: app spec: podSelector: {} policyTypes: - Ingress - Egress podSelector: {} matches every pod in the namespace. policyTypes lists both directions but with no ingress or egress rules below — meaning \u0026ldquo;select all pods, allow nothing.\u0026rdquo; This is the foundation everything else gets layered on top of.\nThe trap: this immediately breaks DNS, because every pod\u0026rsquo;s egress to kube-dns is now denied. You will not realize this until something tries to resolve a hostname.\nAllow DNS — every time apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-dns namespace: app spec: podSelector: {} policyTypes: - Egress egress: - to: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: kube-system podSelector: matchLabels: k8s-app: kube-dns ports: - protocol: UDP port: 53 - protocol: TCP port: 53 A few things to notice. The to block uses both namespaceSelector and podSelector under a single list item — that\u0026rsquo;s an AND. If you put them in two separate list items, it becomes an OR, which is almost never what you want. This is the single most common NetworkPolicy bug I see in PRs.\nThe kubernetes.io/metadata.name label is automatic on every namespace from Kubernetes 1.22 onward. Before that you had to label namespaces yourself.\nAllow ingress from a specific service Suppose your api deployment should only accept connections from the frontend deployment in the same namespace:\napiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: api-allow-frontend namespace: app spec: podSelector: matchLabels: app: api policyTypes: - Ingress ingress: - from: - podSelector: matchLabels: app: frontend ports: - protocol: TCP port: 8080 If frontend lived in another namespace, you\u0026rsquo;d add a namespaceSelector next to the podSelector (still under the same list item — same AND rule).\nEgress to external services This is where it gets awkward. NetworkPolicy egress can target IP CIDRs, but DNS still resolves to whatever IP the upstream wants today. Two patterns work in practice:\nAllow egress to a known CIDR. Works for cloud provider services with documented ranges (S3, RDS, etc.). Run an egress proxy. All outbound traffic to the internet goes through a known set of pods, and the policy only needs to allow egress to those pods. The proxy handles the dynamic-DNS problem. Don\u0026rsquo;t try to allow-list outbound DNS names directly in NetworkPolicy. Some CNI plugins (Cilium) support FQDN-based policies, but that\u0026rsquo;s a CNI-specific extension, not core Kubernetes.\nDebugging when it doesn\u0026rsquo;t work When traffic is being dropped and you don\u0026rsquo;t know why, the order I check things in:\nIs the CNI actually enforcing NetworkPolicy? Calico, Cilium, and AWS VPC CNI with policy enforcement enabled all do. Some older or simpler CNIs do not. Run kubectl describe networkpolicy on the target namespace. Look at which pods are selected and which rules apply. From inside the source pod, try nc -zv \u0026lt;target\u0026gt; \u0026lt;port\u0026gt;. A timeout strongly suggests NetworkPolicy. Connection refused suggests the service is wrong, not the policy. Temporarily add a wide-open allow-all policy and see if the problem clears. If it does, the issue is policy. If it doesn\u0026rsquo;t, look at Service, Endpoints, and CNI logs. The lesson NetworkPolicy is one of those features that\u0026rsquo;s easy to demo and unforgiving in production. The upside is that once you have a clean default-deny posture in every namespace, lateral movement risk drops dramatically and reviewing access becomes a matter of reading a few YAML files.\nStart with default-deny in a non-critical namespace. Add the DNS allowlist immediately. Layer in service-specific rules. Don\u0026rsquo;t try to retrofit it across an entire cluster in one PR.\n","permalink":"https://oypron.com/posts/kubernetes-network-policies/","summary":"\u003cp\u003eThe first time I tried to enforce NetworkPolicy in a real cluster, I broke DNS for an entire namespace and spent the next forty minutes wondering why every pod was returning \u003ccode\u003ei/o timeout\u003c/code\u003e. This post is the guide I wish I had read first.\u003c/p\u003e\n\u003ch2 id=\"the-mental-model\"\u003eThe mental model\u003c/h2\u003e\n\u003cp\u003eA NetworkPolicy is a label-selector-driven firewall rule. It only does two things:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003eSelects pods\u003c/strong\u003e by label (in a single namespace).\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSpecifies allowed ingress, egress, or both\u003c/strong\u003e for those pods.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eThree rules that took me too long to internalize:\u003c/p\u003e","title":"Kubernetes NetworkPolicy: A Practical Guide to Default-Deny"},{"content":"Worker pools are one of those patterns that looks trivial in a blog post and becomes surprisingly difficult in production. The version I keep coming back to is built around three rules: a single owner, explicit cancellation, and a bounded queue. This post walks through the version I actually paste into projects.\nThe naive version Most worker pool examples on the internet look like this:\njobs := make(chan Job, 100) for i := 0; i \u0026lt; 8; i++ { go func() { for j := range jobs { process(j) } }() } This works for the happy path and falls apart everywhere else. There\u0026rsquo;s no way to wait for the workers to finish, no way to stop them early, and no way to know if process panicked. If the producer crashes, the workers stay alive forever.\nThe version I use Here\u0026rsquo;s the shape I prefer:\ntype Pool struct { sem chan struct{} wg sync.WaitGroup } func NewPool(workers int) *Pool { return \u0026amp;Pool{sem: make(chan struct{}, workers)} } func (p *Pool) Submit(ctx context.Context, fn func(context.Context)) error { select { case p.sem \u0026lt;- struct{}{}: case \u0026lt;-ctx.Done(): return ctx.Err() } p.wg.Add(1) go func() { defer p.wg.Done() defer func() { \u0026lt;-p.sem }() defer func() { if r := recover(); r != nil { log.Printf(\u0026#34;worker panic: %v\u0026#34;, r) } }() fn(ctx) }() return nil } func (p *Pool) Wait() { p.wg.Wait() } A few things worth pointing out.\nThe semaphore is the queue. I used to use a buffered channel of Job values, but that mixes two concerns: how many workers run concurrently, and how many jobs can be queued. Using a semaphore-style channel makes the concurrency limit explicit and the queue depth zero — Submit blocks until a worker slot is free or the context is canceled.\nSubmit takes a context. This means callers can give up if the pool is saturated. That\u0026rsquo;s almost always what you want. A Submit that blocks forever is a leak waiting to happen, and \u0026ldquo;wait for ten seconds then give up\u0026rdquo; is hard to retrofit later.\nThe worker function takes a context too. It\u0026rsquo;s the same context the caller passed to Submit, which means cancellation propagates from caller → pool → work. This is the part most naive implementations miss.\nRecover is on by default. A panic in one job should not kill the worker permanently or take down the process. Log it, drop the job, move on.\nPattern: fan-out, fan-in with errors When you have a list of items to process and want to collect either results or the first error, this is what I write:\nfunc processAll(ctx context.Context, items []Item) error { g, ctx := errgroup.WithContext(ctx) g.SetLimit(8) for _, it := range items { it := it g.Go(func() error { return processOne(ctx, it) }) } return g.Wait() } errgroup.WithContext cancels the derived context as soon as any goroutine returns an error, which means the other workers will see ctx.Done() close and can bail out cleanly. SetLimit was added in Go 1.20 and saves you from having to wire up the semaphore yourself.\nThe deadlock you\u0026rsquo;ll hit eventually A subtle deadlock with worker pools: a worker submits a new job to the same pool. If all workers are busy and the queue is full, the worker blocks forever waiting for a slot that will never free up because every slot is held by a worker waiting for a slot.\nThere are two clean fixes:\nUse two pools — one for top-level work, one for sub-work. Set a deadline on the inner Submit so it gives up rather than deadlocking. I prefer the first. Once you start needing recursive submission, the workload is usually structured enough that two pools is the more honest answer.\nWhat about channels? The pure CSP-style approach — pass jobs over a channel, range over it, close it to signal shutdown — still has its place. It\u0026rsquo;s clearer for streaming pipelines where the work is shaped like a conveyor belt. For request/response style work where each call has its own lifetime and context, the semaphore-plus-goroutine version above is simpler and composes better with context.Context.\nTakeaways Always tie work to a context.Context. No exceptions. Make the concurrency limit a first-class field, not an emergent property of channel buffer sizes. Never assume jobs won\u0026rsquo;t panic. They will. errgroup is the right answer for \u0026ldquo;do all of these in parallel and stop on first error.\u0026rdquo; ","permalink":"https://oypron.com/posts/golang-concurrency-patterns/","summary":"\u003cp\u003eWorker pools are one of those patterns that looks trivial in a blog post and becomes surprisingly difficult in production. The version I keep coming back to is built around three rules: a single owner, explicit cancellation, and a bounded queue. This post walks through the version I actually paste into projects.\u003c/p\u003e\n\u003ch2 id=\"the-naive-version\"\u003eThe naive version\u003c/h2\u003e\n\u003cp\u003eMost worker pool examples on the internet look like this:\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;\"\u003e\u003ccode class=\"language-go\" data-lang=\"go\"\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#a6e22e\"\u003ejobs\u003c/span\u003e \u003cspan style=\"color:#f92672\"\u003e:=\u003c/span\u003e make(\u003cspan style=\"color:#66d9ef\"\u003echan\u003c/span\u003e \u003cspan style=\"color:#a6e22e\"\u003eJob\u003c/span\u003e, \u003cspan style=\"color:#ae81ff\"\u003e100\u003c/span\u003e)\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#66d9ef\"\u003efor\u003c/span\u003e \u003cspan style=\"color:#a6e22e\"\u003ei\u003c/span\u003e \u003cspan style=\"color:#f92672\"\u003e:=\u003c/span\u003e \u003cspan style=\"color:#ae81ff\"\u003e0\u003c/span\u003e; \u003cspan style=\"color:#a6e22e\"\u003ei\u003c/span\u003e \u0026lt; \u003cspan style=\"color:#ae81ff\"\u003e8\u003c/span\u003e; \u003cspan style=\"color:#a6e22e\"\u003ei\u003c/span\u003e\u003cspan style=\"color:#f92672\"\u003e++\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    \u003cspan style=\"color:#66d9ef\"\u003ego\u003c/span\u003e \u003cspan style=\"color:#66d9ef\"\u003efunc\u003c/span\u003e() {\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e        \u003cspan style=\"color:#66d9ef\"\u003efor\u003c/span\u003e \u003cspan style=\"color:#a6e22e\"\u003ej\u003c/span\u003e \u003cspan style=\"color:#f92672\"\u003e:=\u003c/span\u003e \u003cspan style=\"color:#66d9ef\"\u003erange\u003c/span\u003e \u003cspan style=\"color:#a6e22e\"\u003ejobs\u003c/span\u003e {\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e            \u003cspan style=\"color:#a6e22e\"\u003eprocess\u003c/span\u003e(\u003cspan style=\"color:#a6e22e\"\u003ej\u003c/span\u003e)\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e        }\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    }()\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e}\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eThis works for the happy path and falls apart everywhere else. There\u0026rsquo;s no way to wait for the workers to finish, no way to stop them early, and no way to know if \u003ccode\u003eprocess\u003c/code\u003e panicked. If the producer crashes, the workers stay alive forever.\u003c/p\u003e","title":"A Worker Pool That Actually Shuts Down: Go Concurrency Patterns Revisited"},{"content":"After almost a decade of writing code professionally, I have built up a fairly large pile of half-finished notes, throwaway gists, and Slack messages to my future self. Most of them never see daylight again. This blog is an attempt to fix that.\nWhy now Earlier this year I rebuilt a service that had been quietly accumulating tech debt for three years. By the time I was done I had read through thousands of lines of code I had written years ago and barely recognized any of it. There were comments like // TODO: figure this out later next to a perfectly reasonable explanation that I clearly knew at the time but no longer remembered. I wished, more than once, that I had written things down somewhere I would actually be able to find them.\nInternal documentation helps, but only up to a point. It tends to be terse, decision-driven, and assumes a lot of context. What I missed was the messy in-between layer: the small lessons that don\u0026rsquo;t deserve a full design doc but are too important to forget. That\u0026rsquo;s roughly what I want this blog to be.\nWhat I plan to write about I work mostly on backend systems — stateful services, queues, databases, the boring infrastructure that holds web applications together. Topics that I expect to come back to:\nConcurrency in Go. Most of my services are written in Go, and most of the production incidents I have ever debugged eventually traced back to a misunderstood goroutine, channel, or context. Postgres internals. Indexing strategy, query plans, vacuum behavior, and the cost of doing things \u0026ldquo;the obvious way\u0026rdquo; at scale. Kubernetes networking. Specifically the parts that nobody enjoys: NetworkPolicy, CNI debugging, DNS, and why your pod cannot reach the service three feet away from it. Observability. Metrics, traces, structured logs, and the trade-offs between them. I have opinions. Distributed systems hygiene. Idempotency, retries, deadlines, and the boring work of making services that don\u0026rsquo;t melt under load. I\u0026rsquo;ll keep posts focused. If a topic runs long, I\u0026rsquo;d rather split it into two pieces than dump everything into one wall of text.\nHow I\u0026rsquo;ll write A few rules I\u0026rsquo;m going to try to hold myself to:\nShow real code. Pseudocode is fine for diagrams; for everything else I\u0026rsquo;d rather paste a working snippet, even if it\u0026rsquo;s a bit longer. I learn best from code I can copy. Be honest about trade-offs. Almost every interesting engineering decision is a trade-off. I\u0026rsquo;ll try to spell out what I gave up to get the thing I wanted. Keep it short when I can. A 600-word post that someone actually finishes is more useful than a 3000-word post that gets tabbed away after the introduction. Update old posts. If I learn that something I wrote was wrong or has aged badly, I\u0026rsquo;ll go back and add a note. I\u0026rsquo;d rather have a correct blog than a pristine one. A note on stack This blog is a Hugo site using the PaperMod theme. Static HTML, no comments, no analytics beyond the basic access log. RSS is enabled if you\u0026rsquo;d rather read it that way. The whole thing fits in a Git repo, which suits me — I think of it as a kind of long-form git log for the things I\u0026rsquo;ve learned.\nStatic-site generators are a good fit for this kind of writing. The output is just files; nothing to patch, nothing to break, no database to migrate when I want to move it somewhere else.\nWhat\u0026rsquo;s next Next post is about a Go concurrency pattern I keep reaching for — a worker pool with a context-aware shutdown. After that, probably Kubernetes NetworkPolicy, because I just spent a weekend untangling a particularly frustrating one and the lesson is fresh.\nThanks for reading. If you stumbled in here from a search engine and the post was useful, that\u0026rsquo;s all the validation I need.\n","permalink":"https://oypron.com/posts/hello-world/","summary":"\u003cp\u003eAfter almost a decade of writing code professionally, I have built up a fairly large pile of half-finished notes, throwaway gists, and Slack messages to my future self. Most of them never see daylight again. This blog is an attempt to fix that.\u003c/p\u003e\n\u003ch2 id=\"why-now\"\u003eWhy now\u003c/h2\u003e\n\u003cp\u003eEarlier this year I rebuilt a service that had been quietly accumulating tech debt for three years. By the time I was done I had read through thousands of lines of code I had written years ago and barely recognized any of it. There were comments like \u003ccode\u003e// TODO: figure this out later\u003c/code\u003e next to a perfectly reasonable explanation that I clearly knew at the time but no longer remembered. I wished, more than once, that I had written things down somewhere I would actually be able to find them.\u003c/p\u003e","title":"Hello World — Why I'm Starting This Blog"}]