Skip to content

Distributed_tracing

Understanding Request Flow Across Microservices

Section titled “Understanding Request Flow Across Microservices”

In a microservices architecture, a single user request can flow through dozens of services. Distributed tracing lets you see the complete path.

The Problem: Microservices = Distributed Chaos
=============================================
User clicks "Place Order"
┌─────────────────┐
│ API Gateway │──────────┐
└────────┬────────┘ │
│ ▼
│ ┌─────────────────┐
│ │ Auth Service │
│ └────────┬────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Order Service │ │ User Service │
│ │ │ │
└────────┬────────┘ └────────┬────────┘
│ │
│ ┌───────────┴───────────┐
▼ ▼ ▼
┌───────────┐ ┌─────────┐ ┌────────────┐
│ Payment │ │ DB │ │ Notification│
│ Service │ │ │ │ Service │
└───────────┘ └─────────┘ └────────────┘
Without Tracing:
───────────────
"Order failed" → Which service? Why? How long?
With Tracing:
───────────────
"Order failed at Payment Service:
Stripe API timeout (1.2s)
Called from Order Service (line 142)"
Trace = Complete request journey
Span = Single operation
┌─────────────────────────────────────────────────────────────┐
│ Trace: POST /checkout (total: 2.5s) │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Span: api-gateway (50ms) │ │
│ │ └───────────────────────────────────────────────┐ │ │
│ │ │ Span: auth-service (30ms) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌─────────────────┴───────────────────────────┐ │ │
│ │ │ Span: order-service (800ms) │ │ │
│ │ │ │ │ │
│ │ │ ┌──────────────┐ ┌──────────────────┐ │ │ │
│ │ │ │payment-svc │ │ inventory-svc │ │ │ │
│ │ │ │ (500ms) │ │ (200ms) │ │ │ │
│ │ │ └──────────────┘ └──────────────────┘ │ │ │
│ │ └────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌─────────────────┴───────────────────────────┐ │ │
│ │ │ Span: notification-svc (100ms) │ │ │
│ │ └────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

To correlate spans across services, trace context must be passed in HTTP headers.

Trace Context Headers (W3C Standard)
====================================
When Service A calls Service B:
Request from A to B:
┌────────────────────────────────────────────────────────────┐
│ traceparent: 00-0af7651916cd43dd8448eb211c80319-cb3c1c71a3c1f46-01
│ │ │ │ │
│ │ │ │ └── Span ID
│ │ │ └──── Trace ID
│ │ └─────────────────────────────────────────── Version
│ └────────────────────────────────────────────── Flags
└────────────────────────────────────────────────────────────┘
Additional (optional):
─────────────────────
tracestate: congo=t61rcWkgMzE
X-B3-TraceId: 0af7651916cd43dd8448eb211c80319
X-B3-SpanId: cb3c1c71a3c1f46
X-B3-ParentSpanId: a3c1c71a3c1f46
X-B3-Sampled: 1
─────────────────────────────────────────────────────────────
Service B receives request:
1. Extract traceparent header
2. Create NEW span with ParentSpanId = incoming SpanId
3. Generate NEW SpanId for itself
4. Add its own timing
5. Pass to next service (C)

Span Structure
=============
{
"traceId": "0af7651916cd43dd8448eb211c80319", ← Same for all
"spanId": "cb3c1c71a3c1f46", ← Unique per span
"parentSpanId": "a3c1c71a3c1f46", ← Parent's spanId
"name": "POST /checkout",
"timestamp": 1699999999000000,
"duration": 800000, // microseconds
"kind": "SERVER", // CLIENT or SERVER
"status": {
"code": "OK"
},
"attributes": {
"http.method": "POST",
"http.url": "/api/v1/checkout",
"http.status_code": "200",
"user.id": "user-123",
"order.id": "order-456"
},
"events": [
{
"name": "cache miss",
"timestamp": 1699999999100000
},
{
"name": "db query started",
"timestamp": 1699999999200000
}
]
}

ToolTypeProsCons
JaegerOpen SourceFull-featured, CNCFComplex setup
ZipkinOpen SourceLightweight, simpleLess features
AWS X-RayManagedAWS nativeAWS only
DataDogCommercialAPM + TracingExpensive
TempoOpen SourceGrafana nativeNewer
# Python with OpenTelemetry
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Setup
trace.set_tracer_provider(TracerProvider())
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
tracer = trace.get_tracer(__name__)
# Usage
@app.route('/checkout', methods=['POST'])
def checkout():
with tracer.start_as_current_span("checkout") as span:
span.set_attribute("user_id", request.user_id)
# Call auth service
with tracer.start_as_current_span("validate_token") as span:
token = request.headers.get("Authorization")
user = auth_service.validate(token)
span.set_attribute("user.valid", True)
# Call order service
with tracer.start_as_current_span("create_order") as span:
span.set_attribute("order.items_count", len(items))
order = order_service.create(user, items)
# Call notification (async - doesn't block)
notification_service.send(order.id)
return {"order_id": order.id}
// Node.js with OpenTelemetry
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const sdk = new NodeSDK({
instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()],
exporter: new JaegerExporter({ endpoint: 'http://jaeger:14268/api/traces' })
});
sdk.start();
// Usage in Express
app.post('/checkout', async (req, res) => {
const span = trace.getSpan(context.active());
span.setAttribute('user.id', req.user.id);
// Database call with auto-instrumentation
const user = await db.users.findById(req.user.id);
// External call
const payment = await stripe.charges.create({...});
res.json({ orderId: order.id });
});

Jaeger Trace View
================
┌─────────────────────────────────────────────────────────────┐
│ Trace: POST /api/orders (2.342s) │
│ ID: 0af7651916cd43dd8448eb211c80319 │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │████████████ api-gateway (50ms) │ │
│ │ │ │ │
│ │ ┌────┴────┐ │ │
│ │ │ │ │ │
│ │ ┌──┴──┐ ┌───┴────┐ │ │
│ │ │auth │ │orders │ (1.8s) │ │
│ │ │(30ms)│ │ │ │ │
│ │ └──────┘ │ ┌────┴──────┐ │ │
│ │ │ │ │ │ │
│ │ │ ┌─┴──┐ ┌──────┴─────┐ │ │
│ │ │ │pay │ │notification│ (100ms) │ │
│ │ │ │(1s)│ │ │ │ │
│ │ │ └────┘ └────────────┘ │ │
│ │ └──────────────────────────────────────────┘ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Analysis:
─────────
• Total: 2.342s
• Orders service: 1.8s (77% of total!)
• Payment service: 1s (55% of total!)
→ Payment service is the bottleneck!

Tracing Best Practices
=====================
✓ Always include correlation IDs (user ID, order ID, session ID)
✓ Sample appropriately (100% for errors, 1-10% for normal traffic)
✓ Add span events for important actions
✓ Tag database queries with table names
✓ Tag external calls with service names
✓ Include error information in spans
✓ Use standard header propagation
✓ Create custom spans for business operations
─────────────────────────────────────────
What NOT to do:
───────────────
✗ Log sensitive data in spans
✗ Create too many spans (performance impact)
✗ Trace everything at 100% (cost!)
✗ Forget to propagate context to async operations

  1. Trace - Complete request journey across services
  2. Span - Single operation with timing
  3. Trace ID - Correlates all spans in a request
  4. Context propagation - Pass trace headers between services
  5. OpenTelemetry - Vendor-neutral standard
  6. Jaeger/Zipkin - Popular visualization tools
  7. Sampling - Trace subset to manage cost

Next: Chapter 36: TLS/SSL & Encryption