Distributed_tracing
Chapter 29: Distributed Tracing
Section titled “Chapter 29: Distributed Tracing”Understanding Request Flow Across Microservices
Section titled “Understanding Request Flow Across Microservices”29.1 What is Distributed Tracing?
Section titled “29.1 What is Distributed Tracing?”In a microservices architecture, a single user request can flow through dozens of services. Distributed tracing lets you see the complete path.
The Problem: Microservices = Distributed Chaos =============================================
User clicks "Place Order" │ ▼ ┌─────────────────┐ │ API Gateway │──────────┐ └────────┬────────┘ │ │ ▼ │ ┌─────────────────┐ │ │ Auth Service │ │ └────────┬────────┘ │ │ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ Order Service │ │ User Service │ │ │ │ │ └────────┬────────┘ └────────┬────────┘ │ │ │ ┌───────────┴───────────┐ ▼ ▼ ▼ ┌───────────┐ ┌─────────┐ ┌────────────┐ │ Payment │ │ DB │ │ Notification│ │ Service │ │ │ │ Service │ └───────────┘ └─────────┘ └────────────┘
Without Tracing: ─────────────── "Order failed" → Which service? Why? How long?
With Tracing: ─────────────── "Order failed at Payment Service: Stripe API timeout (1.2s) Called from Order Service (line 142)"Trace vs Span
Section titled “Trace vs Span” Trace = Complete request journey Span = Single operation
┌─────────────────────────────────────────────────────────────┐ │ Trace: POST /checkout (total: 2.5s) │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Span: api-gateway (50ms) │ │ │ │ └───────────────────────────────────────────────┐ │ │ │ │ │ Span: auth-service (30ms) │ │ │ │ │ └─────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ │ ┌─────────────────┴───────────────────────────┐ │ │ │ │ │ Span: order-service (800ms) │ │ │ │ │ │ │ │ │ │ │ │ ┌──────────────┐ ┌──────────────────┐ │ │ │ │ │ │ │payment-svc │ │ inventory-svc │ │ │ │ │ │ │ │ (500ms) │ │ (200ms) │ │ │ │ │ │ │ └──────────────┘ └──────────────────┘ │ │ │ │ │ └────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ │ ┌─────────────────┴───────────────────────────┐ │ │ │ │ │ Span: notification-svc (100ms) │ │ │ │ │ └────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘29.2 Trace Context Propagation
Section titled “29.2 Trace Context Propagation”To correlate spans across services, trace context must be passed in HTTP headers.
Trace Context Headers (W3C Standard) ====================================
When Service A calls Service B:
Request from A to B: ┌────────────────────────────────────────────────────────────┐ │ traceparent: 00-0af7651916cd43dd8448eb211c80319-cb3c1c71a3c1f46-01 │ │ │ │ │ │ │ │ │ └── Span ID │ │ │ └──── Trace ID │ │ └─────────────────────────────────────────── Version │ └────────────────────────────────────────────── Flags └────────────────────────────────────────────────────────────┘
Additional (optional): ───────────────────── tracestate: congo=t61rcWkgMzE X-B3-TraceId: 0af7651916cd43dd8448eb211c80319 X-B3-SpanId: cb3c1c71a3c1f46 X-B3-ParentSpanId: a3c1c71a3c1f46 X-B3-Sampled: 1
─────────────────────────────────────────────────────────────
Service B receives request: 1. Extract traceparent header 2. Create NEW span with ParentSpanId = incoming SpanId 3. Generate NEW SpanId for itself 4. Add its own timing 5. Pass to next service (C)29.3 Span Anatomy
Section titled “29.3 Span Anatomy” Span Structure =============
{ "traceId": "0af7651916cd43dd8448eb211c80319", ← Same for all "spanId": "cb3c1c71a3c1f46", ← Unique per span "parentSpanId": "a3c1c71a3c1f46", ← Parent's spanId "name": "POST /checkout", "timestamp": 1699999999000000, "duration": 800000, // microseconds "kind": "SERVER", // CLIENT or SERVER "status": { "code": "OK" }, "attributes": { "http.method": "POST", "http.url": "/api/v1/checkout", "http.status_code": "200", "user.id": "user-123", "order.id": "order-456" }, "events": [ { "name": "cache miss", "timestamp": 1699999999100000 }, { "name": "db query started", "timestamp": 1699999999200000 } ] }29.4 Tools & Implementation
Section titled “29.4 Tools & Implementation”Popular Tracing Tools
Section titled “Popular Tracing Tools”| Tool | Type | Pros | Cons |
|---|---|---|---|
| Jaeger | Open Source | Full-featured, CNCF | Complex setup |
| Zipkin | Open Source | Lightweight, simple | Less features |
| AWS X-Ray | Managed | AWS native | AWS only |
| DataDog | Commercial | APM + Tracing | Expensive |
| Tempo | Open Source | Grafana native | Newer |
Implementation with OpenTelemetry
Section titled “Implementation with OpenTelemetry”# Python with OpenTelemetryfrom opentelemetry import tracefrom opentelemetry.exporter.jaeger.thrift import JaegerExporterfrom opentelemetry.sdk.trace import TracerProviderfrom opentelemetry.sdk.trace.export import BatchSpanProcessor
# Setuptrace.set_tracer_provider(TracerProvider())jaeger_exporter = JaegerExporter( agent_host_name="jaeger", agent_port=6831,)trace.get_tracer_provider().add_span_processor( BatchSpanProcessor(jaeger_exporter))
tracer = trace.get_tracer(__name__)
# Usage@app.route('/checkout', methods=['POST'])def checkout(): with tracer.start_as_current_span("checkout") as span: span.set_attribute("user_id", request.user_id)
# Call auth service with tracer.start_as_current_span("validate_token") as span: token = request.headers.get("Authorization") user = auth_service.validate(token) span.set_attribute("user.valid", True)
# Call order service with tracer.start_as_current_span("create_order") as span: span.set_attribute("order.items_count", len(items)) order = order_service.create(user, items)
# Call notification (async - doesn't block) notification_service.send(order.id)
return {"order_id": order.id}// Node.js with OpenTelemetryconst { NodeSDK } = require('@opentelemetry/sdk-node');const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const sdk = new NodeSDK({ instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()], exporter: new JaegerExporter({ endpoint: 'http://jaeger:14268/api/traces' })});
sdk.start();
// Usage in Expressapp.post('/checkout', async (req, res) => { const span = trace.getSpan(context.active()); span.setAttribute('user.id', req.user.id);
// Database call with auto-instrumentation const user = await db.users.findById(req.user.id);
// External call const payment = await stripe.charges.create({...});
res.json({ orderId: order.id });});29.5 Visualizing Traces
Section titled “29.5 Visualizing Traces” Jaeger Trace View ================
┌─────────────────────────────────────────────────────────────┐ │ Trace: POST /api/orders (2.342s) │ │ ID: 0af7651916cd43dd8448eb211c80319 │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │████████████ api-gateway (50ms) │ │ │ │ │ │ │ │ │ ┌────┴────┐ │ │ │ │ │ │ │ │ │ │ ┌──┴──┐ ┌───┴────┐ │ │ │ │ │auth │ │orders │ (1.8s) │ │ │ │ │(30ms)│ │ │ │ │ │ │ └──────┘ │ ┌────┴──────┐ │ │ │ │ │ │ │ │ │ │ │ │ ┌─┴──┐ ┌──────┴─────┐ │ │ │ │ │ │pay │ │notification│ (100ms) │ │ │ │ │ │(1s)│ │ │ │ │ │ │ │ └────┘ └────────────┘ │ │ │ │ └──────────────────────────────────────────┘ │ │ └─────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘
Analysis: ───────── • Total: 2.342s • Orders service: 1.8s (77% of total!) • Payment service: 1s (55% of total!)
→ Payment service is the bottleneck!29.6 Best Practices
Section titled “29.6 Best Practices” Tracing Best Practices =====================
✓ Always include correlation IDs (user ID, order ID, session ID) ✓ Sample appropriately (100% for errors, 1-10% for normal traffic) ✓ Add span events for important actions ✓ Tag database queries with table names ✓ Tag external calls with service names ✓ Include error information in spans ✓ Use standard header propagation ✓ Create custom spans for business operations
─────────────────────────────────────────
What NOT to do: ─────────────── ✗ Log sensitive data in spans ✗ Create too many spans (performance impact) ✗ Trace everything at 100% (cost!) ✗ Forget to propagate context to async operationsSummary
Section titled “Summary”- Trace - Complete request journey across services
- Span - Single operation with timing
- Trace ID - Correlates all spans in a request
- Context propagation - Pass trace headers between services
- OpenTelemetry - Vendor-neutral standard
- Jaeger/Zipkin - Popular visualization tools
- Sampling - Trace subset to manage cost