Sahil Khundiya | Backend & AI Infrastructure Engineer

The Phantom Delay

In a distributed system powered by Kafka, a single request might traverse 5 microservices across 3 different event topics. When a user experiences a 2-second lag, where is it happening?

Is it the Producer?
The Broker?
The Consumer thread pool?
The Database lock?

The Solution: OpenTelemetry & Trace Propagation

We moved beyond standard ELK logging to Distributed Tracing. By injecting a trace_id into the Kafka headers, we can correlate logs across the entire journey.

// Propagating trace headers in Kafka
ProducerRecord<String, String> record = new ProducerRecord<>("topic", "message");
record.headers().add("traceparent", currentSpan().getTraceId().getBytes());
producer.send(record);

Visualizing the Pipeline

With Jaeger and Grafana, we now have a "Service Map" that highlights bottlenecks in red. This allowed us to identify a specific Kafka partition that was "hot" due to an uneven hashing strategy, which was causing 90% of our latency spikes.

Final Takeaway

You cannot optimize what you cannot measure. In distributed systems, observability is not a "nice to have"—it is your only defense against the complexity of asynchronous failures.

The Observability Gap: Beyond Just Logs and Metrics

The Phantom Delay

The Solution: OpenTelemetry & Trace Propagation

Visualizing the Pipeline

Final Takeaway