Logging
Chapter 26: Logging Best Practices
Section titled “Chapter 26: Logging Best Practices”Structured Logging for Production Systems
Section titled “Structured Logging for Production Systems”26.1 Why Logging Matters
Section titled “26.1 Why Logging Matters”Logging is the foundation of observability. Without logs, debugging production issues is like finding a needle in a haystack.
The Debugging Journey ====================
Without Proper Logging: ──────────────────────
User: "I can't place an order!" Support: "Can you describe what happened?" User: "Nothing! The page just doesn't work!" Dev: *checks code* *guesses* *tries random fix*
Time: 4 hours Result: Still broken, added more bugs
─────────────────────────────────────────
With Proper Logging: ────────────────────
User: "I can't place an order!" Dev: *searches logs for user ID* Found: "PaymentGatewayException: Card declined - insufficient funds (order-123, user-456)"
Time: 2 minutes Result: Fixed! Notified user about fundsWhat Can Go Wrong Without Logging
Section titled “What Can Go Wrong Without Logging”| Scenario | Without Logs | With Logs |
|---|---|---|
| API failing | ”API is down" | "POST /checkout returning 500: DB connection timeout” |
| Slow response | ”It’s slow" | "GET /search took 8.2s: Full table scan on users table” |
| Security issue | ”Someone did something" | "Failed login attempt: user@evil.com from IP 1.2.3.4 (50th attempt)“ |
| Data issue | ”Wrong total" | "Order total recalculated: 99.99 → 149.99 (coupon expired)“ |
26.2 Log Levels
Section titled “26.2 Log Levels”Understanding and using correct log levels is crucial for effective logging.
Log Level Hierarchy ==================
┌─────────────────────────────────────────────────────────────┐ │ FATAL/CRITICAL - System crash, data loss imminent │ │ ─────────────────────────────────────────────────────────│ │ Examples: │ │ • Database connection permanently lost │ │ • Disk full │ │ • Uncaught exception causing service crash │ │ Color: RED │ └─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐ │ ERROR - Operation failed but system continues │ │ ─────────────────────────────────────────────────────────│ │ Examples: │ │ • API call to payment service failed │ │ • Database query failed │ │ • File not found │ │ Color: RED │ └─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐ │ WARN - Unexpected situation, may need attention │ │ ─────────────────────────────────────────────────────────│ │ Examples: │ │ • Using deprecated API │ │ • Retry attempt failed │ │ • Config using default value │ │ Color: YELLOW │ └─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐ │ INFO - Normal operation events │ │ ─────────────────────────────────────────────────────────│ │ Examples: │ │ • Server started on port 8080 │ │ • User logged in successfully │ │ • Order created: order-123 │ │ Color: GREEN/DEFAULT │ └─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐ │ DEBUG - Detailed debugging information │ │ ─────────────────────────────────────────────────────────│ │ Examples: │ │ • Function called with params: {a: 1, b: 2} │ │ • Loop iteration: i=5, value=abc │ │ • Cache miss for key: user:123 │ │ Color: GRAY │ └─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐ │ TRACE - Ultra-verbose, detailed flow │ │ ─────────────────────────────────────────────────────────│ │ Examples: │ │ • Entering function X │ │ • Exiting function Y │ │ • HTTP request: GET /api/users (10ms) │ │ Color: GRAY │ └─────────────────────────────────────────────────────────────┘When to Use Each Level
Section titled “When to Use Each Level”# Examples in Python
# FATAL - System can't continuetry: db.connect()except DatabaseUnavailable as e: # Can't operate without DB - this is FATAL logger.fatal(f"Cannot start: Database unavailable: {e}") sys.exit(1)
# ERROR - Operation failedtry: payment = stripe.charge(card, amount)except StripeError as e: # Operation failed, but we can handle it logger.error(f"Payment failed for order {order_id}: {e}") return ErrorResponse("Payment failed")
# WARN - Potential issueif len(results) > 10000: # This might be a performance issue logger.warn(f"Large result set: {len(results)} rows (consider pagination)")
# INFO - Normal operationslogger.info(f"User {user_id} created order {order_id}")logger.info(f"Server started on port {PORT}")
# DEBUG - Debugging infologger.debug(f"Processing request: {request_id}, params: {params}")logger.debug(f"Cache hit: user_profile:{user_id}")
# TRACE - Detailed flowlogger.trace(f"Entering validate_order(), order_id={order_id}")26.3 Structured Logging
Section titled “26.3 Structured Logging”Structured logging transforms human-readable text into machine-parseable data.
Text vs JSON Logging
Section titled “Text vs JSON Logging” Bad (Plain Text): =================
2024-01-15 10:30:00 ERROR User login failed for john@example.com from IP 1.2.3.4 - Invalid password
Problems: • Can't search by user ID efficiently • Can't filter by IP • Can't parse error codes • Can't build dashboards
─────────────────────────────────────────
Good (JSON Structured): ======================
{ "timestamp": "2024-01-15T10:30:00.000Z", "level": "ERROR", "logger": "auth-service", "message": "User login failed", "context": { "userId": "user-123", "email": "john@example.com", "ip": "1.2.3.4", "userAgent": "Mozilla/5.0...", "reason": "invalid_password", "attemptNumber": 2 }, "traceId": "abc-123-def", "spanId": "span-456" }
Benefits: ✓ Search by any field ✓ Filter by level, user, IP ✓ Build metrics dashboards ✓ Correlate with tracesEssential Fields
Section titled “Essential Fields”{ "timestamp": "2024-01-15T10:30:00.000Z", "level": "ERROR", "message": "Order payment failed",
"service": "order-service", "version": "1.2.3", "environment": "production",
"context": { "orderId": "ord-123", "userId": "user-456", "amount": 99.99, "currency": "USD" },
"error": { "type": "PaymentDeclined", "code": "INSUFFICIENT_FUNDS", "message": "Card declined" },
"trace": { "traceId": "abc-123", "spanId": "xyz-456", "parentSpanId": "par-789" }}Logging in Different Languages
Section titled “Logging in Different Languages”# Python with structlogimport structlogimport logging
structlog.configure( processors=[ structlog.stdlib.filter_by_level, structlog.stdlib.add_logger_name, structlog.stdlib.add_log_level, structlog.processors.TimeStamper(fmt="iso"), structlog.processors.StackInfoRenderer(), structlog.processors.format_exc_info, structlog.processors.JSONRenderer() ], wrapper_class=structlog.stdlib.BoundLogger, context_class=dict, logger_factory=structlog.stdlib.LoggerFactory(),)
logger = structlog.get_logger()
# Usagelogger.info( "order_created", order_id="ord-123", user_id="user-456", amount=99.99)// Node.js with Pinoconst pino = require('pino');
const logger = pino({ level: 'info', timestamp: () => `,"timestamp":"${new Date().toISOString()}"`, formatters: { level: (label) => ({ level: label }) }});
// Usagelogger.info({ orderId: 'ord-123', userId: 'user-456', amount: 99.99}, 'Order created');// Java with SLF4J + Logbackimport org.slf4j.Logger;import org.slf4j.LoggerFactory;import org.slf4j.MDC;
public class OrderService { private static final Logger logger = LoggerFactory.getLogger(OrderService.class);
public void createOrder(Order order) { try { MDC.put("orderId", order.getId()); MDC.put("userId", order.getUserId());
logger.info("Order created: amount={} currency={}", order.getAmount(), order.getCurrency()); } finally { MDC.clear(); } }}26.4 What to Log
Section titled “26.4 What to Log”Request/Response Logging
Section titled “Request/Response Logging”# Good request logginglogger.info("api_request", method="POST", path="/api/orders", query={"status": "pending"}, user_id=user_id, request_id=request_id, content_length=1024)
# Response logginglogger.info("api_response", method="POST", path="/api/orders", status=201, duration_ms=150, response_order_id="ord-123")Error Logging
Section titled “Error Logging”# Always include context!try: result = process_payment(order)except PaymentException as e: logger.error("payment_failed", order_id=order.id, amount=order.amount, error_code=e.code, error_message=str(e), user_id=order.user_id, card_last_four=order.card_last_four, # Don't log: full card number, CVV, password! exc_info=True # Includes stack trace ) raiseBusiness Events
Section titled “Business Events”# Log important business eventslogger.info("user_registered", user_id=user.id, email=user.email, referrer=user.referrer, signup_method="google")
logger.info("order_placed", order_id=order.id, user_id=order.user_id, item_count=len(order.items), total=order.total, coupon_used=order.coupon_code)
logger.info("subscription_created", user_id=user.id, plan="premium", billing_cycle="monthly", amount=19.99)26.5 What NOT to Log
Section titled “26.5 What NOT to Log” CRITICAL: Never Log These! =========================
┌─────────────────────────────────────────────────────────────┐ │ PASSWORD/SECRETS │ │ ─────────────────────────────────────────────────────────│ │ ✗ User passwords │ │ ✗ API keys and tokens │ │ ✗ OAuth secrets │ │ ✗ Database connection strings │ │ ✗ Private keys │ │ │ │ ✓ Log: "Login attempt" (without password) │ │ ✓ Log: "API call failed" (without keys) │ └─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐ │ PERSONAL IDENTIFIABLE INFORMATION (PII) │ │ ─────────────────────────────────────────────────────────│ │ ✗ Full credit card numbers │ │ ✗ Social Security Numbers │ │ ✗ Phone numbers (unless necessary) │ │ ✗ Home addresses │ │ ✗ Medical information │ │ │ │ ✓ Log: "Payment processed" (not card number) │ │ ✓ Log: User ID (not full name if not needed) │ └─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐ │ SENSITIVE BUSINESS DATA │ │ ─────────────────────────────────────────────────────────│ │ ✗ Full request/response bodies (can contain secrets) │ │ ✗ Customer data in bulk │ │ ✗ Internal system details that aid attackers │ └─────────────────────────────────────────────────────────────┘Data Masking
Section titled “Data Masking”import re
def mask_sensitive_data(message: str) -> str: """Mask sensitive data in log messages""" # Mask credit cards message = re.sub(r'\b\d{13,16}\b', '****-****-****-****', message) # Mask SSN message = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '***-**-****', message) # Mask API keys (common patterns) message = re.sub(r'(api[_-]?key["\']?\s*[:=]\s*["\']?)\w+', r'\1***', message) return message26.6 Log Aggregation Architecture
Section titled “26.6 Log Aggregation Architecture” Log Aggregation Pipeline ========================
┌─────────────────────────────────────────────────────────────┐ │ Application Servers │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Server 1 │ │ Server 2 │ │ Server N │ │ │ │ app.log │ │ app.log │ │ app.log │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ └───────┼─────────────┼─────────────┼────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Log Shippers (Filebeat, Fluentd, CloudWatch Agent) │ │ ─────────────────────────────────────────────────────────│ │ • Read log files │ │ • Parse/transform │ │ • Send to centralized system │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Log Storage & Analysis │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ ELK │ │ Splunk │ │ CloudWatch │ │ │ │ Stack │ │ │ │ Logs │ │ │ │(Elastic) │ │ │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ • Search & Query │ │ • Dashboards │ │ • Alerts │ └─────────────────────────────────────────────────────────────┘Popular Log Aggregation Tools
Section titled “Popular Log Aggregation Tools”| Tool | Type | Pros | Cons |
|---|---|---|---|
| ELK Stack | Open Source | Free, flexible | Complex setup |
| Splunk | Commercial | Powerful, enterprise | Expensive |
| CloudWatch | AWS Managed | Native AWS | AWS-only |
| DataDog | SaaS | Great UI | Cost |
| Sumo Logic | SaaS | Security focus | Pricing |
| Grafana + Loki | Open Source | Cost effective | Newer |
26.7 Best Practices Summary
Section titled “26.7 Best Practices Summary” Logging Best Practices Checklist ===============================
✓ Use structured JSON logging ✓ Include consistent context (request ID, user ID) ✓ Use appropriate log levels ✓ Log errors with full context and stack traces ✓ Never log passwords, secrets, or PII ✓ Implement log rotation ✓ Centralize logs for searching ✓ Set up alerts on error thresholds ✓ Monitor log volume to control costs ✓ Use correlation IDs for request tracing
─────────────────────────────────────────
Log Volume Guidelines: ───────────────────── • INFO: Expected operations • WARN: Things that need attention • ERROR: Failures (always) • DEBUG: Enable in dev, selective in prod
Target: 100MB - 1GB per instance per day (varies by application)Summary
Section titled “Summary”- Log levels - FATAL > ERROR > WARN > INFO > DEBUG > TRACE
- Structured logging - Use JSON format
- Context - Always include request ID, user ID
- Security - Never log secrets or PII
- Centralize - Aggregate in ELK, Splunk, or cloud service
- Correlation - Use trace/request ID to track requests
- Rotation - Prevent disk overflow