Logging

Chapter 26: Logging Best Practices

Structured Logging for Production Systems

26.1 Why Logging Matters

Logging is the foundation of observability. Without logs, debugging production issues is like finding a needle in a haystack.

    The Debugging Journey
    ====================

    Without Proper Logging:
    ──────────────────────

    User: "I can't place an order!"
    Support: "Can you describe what happened?"
    User: "Nothing! The page just doesn't work!"
    Dev: *checks code* *guesses* *tries random fix*

    Time: 4 hours
    Result: Still broken, added more bugs

    ─────────────────────────────────────────

    With Proper Logging:
    ────────────────────

    User: "I can't place an order!"
    Dev: *searches logs for user ID*
    Found: "PaymentGatewayException: Card declined -
            insufficient funds (order-123, user-456)"

    Time: 2 minutes
    Result: Fixed! Notified user about funds

What Can Go Wrong Without Logging

Scenario	Without Logs	With Logs
API failing	”API is down"	"POST /checkout returning 500: DB connection timeout”
Slow response	”It’s slow"	"GET /search took 8.2s: Full table scan on users table”
Security issue	”Someone did something"	"Failed login attempt: user@evil.com from IP 1.2.3.4 (50th attempt)“
Data issue	”Wrong total"	"Order total recalculated: 99.99 → 149.99 (coupon expired)“

26.2 Log Levels

Understanding and using correct log levels is crucial for effective logging.

    Log Level Hierarchy
    ==================

    ┌─────────────────────────────────────────────────────────────┐
    │  FATAL/CRITICAL - System crash, data loss imminent         │
    │  ─────────────────────────────────────────────────────────│
    │  Examples:                                                 │
    │  • Database connection permanently lost                  │
    │  • Disk full                                              │
    │  • Uncaught exception causing service crash              │
    │  Color: RED                                               │
    └─────────────────────────────────────────────────────────────┘

    ┌─────────────────────────────────────────────────────────────┐
    │  ERROR - Operation failed but system continues            │
    │  ─────────────────────────────────────────────────────────│
    │  Examples:                                                 │
    │  • API call to payment service failed                    │
    │  • Database query failed                                   │
    │  • File not found                                          │
    │  Color: RED                                               │
    └─────────────────────────────────────────────────────────────┘

    ┌─────────────────────────────────────────────────────────────┐
    │  WARN - Unexpected situation, may need attention          │
    │  ─────────────────────────────────────────────────────────│
    │  Examples:                                                 │
    │  • Using deprecated API                                   │
    │  • Retry attempt failed                                    │
    │  • Config using default value                              │
    │  Color: YELLOW                                            │
    └─────────────────────────────────────────────────────────────┘

    ┌─────────────────────────────────────────────────────────────┐
    │  INFO - Normal operation events                           │
    │  ─────────────────────────────────────────────────────────│
    │  Examples:                                                 │
    │  • Server started on port 8080                            │
    │  • User logged in successfully                             │
    │  • Order created: order-123                               │
    │  Color: GREEN/DEFAULT                                      │
    └─────────────────────────────────────────────────────────────┘

    ┌─────────────────────────────────────────────────────────────┐
    │  DEBUG - Detailed debugging information                   │
    │  ─────────────────────────────────────────────────────────│
    │  Examples:                                                 │
    │  • Function called with params: {a: 1, b: 2}             │
    │  • Loop iteration: i=5, value=abc                         │
    │  • Cache miss for key: user:123                           │
    │  Color: GRAY                                              │
    └─────────────────────────────────────────────────────────────┘

    ┌─────────────────────────────────────────────────────────────┐
    │  TRACE - Ultra-verbose, detailed flow                     │
    │  ─────────────────────────────────────────────────────────│
    │  Examples:                                                 │
    │  • Entering function X                                    │
    │  • Exiting function Y                                     │
    │  • HTTP request: GET /api/users (10ms)                   │
    │  Color: GRAY                                              │
    └─────────────────────────────────────────────────────────────┘

When to Use Each Level

# Examples in Python

# FATAL - System can't continue
try:
    db.connect()
except DatabaseUnavailable as e:
    # Can't operate without DB - this is FATAL
    logger.fatal(f"Cannot start: Database unavailable: {e}")
    sys.exit(1)

# ERROR - Operation failed
try:
    payment = stripe.charge(card, amount)
except StripeError as e:
    # Operation failed, but we can handle it
    logger.error(f"Payment failed for order {order_id}: {e}")
    return ErrorResponse("Payment failed")

# WARN - Potential issue
if len(results) > 10000:
    # This might be a performance issue
    logger.warn(f"Large result set: {len(results)} rows
                 (consider pagination)")

# INFO - Normal operations
logger.info(f"User {user_id} created order {order_id}")
logger.info(f"Server started on port {PORT}")

# DEBUG - Debugging info
logger.debug(f"Processing request: {request_id},
              params: {params}")
logger.debug(f"Cache hit: user_profile:{user_id}")

# TRACE - Detailed flow
logger.trace(f"Entering validate_order(), order_id={order_id}")

26.3 Structured Logging

Structured logging transforms human-readable text into machine-parseable data.

Text vs JSON Logging

    Bad (Plain Text):
    =================

    2024-01-15 10:30:00 ERROR User login failed for john@example.com
    from IP 1.2.3.4 - Invalid password

    Problems:
    • Can't search by user ID efficiently
    • Can't filter by IP
    • Can't parse error codes
    • Can't build dashboards

    ─────────────────────────────────────────

    Good (JSON Structured):
    ======================

    {
      "timestamp": "2024-01-15T10:30:00.000Z",
      "level": "ERROR",
      "logger": "auth-service",
      "message": "User login failed",
      "context": {
        "userId": "user-123",
        "email": "john@example.com",
        "ip": "1.2.3.4",
        "userAgent": "Mozilla/5.0...",
        "reason": "invalid_password",
        "attemptNumber": 2
      },
      "traceId": "abc-123-def",
      "spanId": "span-456"
    }

    Benefits:
    ✓ Search by any field
    ✓ Filter by level, user, IP
    ✓ Build metrics dashboards
    ✓ Correlate with traces

Essential Fields

{
  "timestamp": "2024-01-15T10:30:00.000Z",
  "level": "ERROR",
  "message": "Order payment failed",

  "service": "order-service",
  "version": "1.2.3",
  "environment": "production",

  "context": {
    "orderId": "ord-123",
    "userId": "user-456",
    "amount": 99.99,
    "currency": "USD"
  },

  "error": {
    "type": "PaymentDeclined",
    "code": "INSUFFICIENT_FUNDS",
    "message": "Card declined"
  },

  "trace": {
    "traceId": "abc-123",
    "spanId": "xyz-456",
    "parentSpanId": "par-789"
  }
}

Logging in Different Languages

# Python with structlog
import structlog
import logging

structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
)

logger = structlog.get_logger()

# Usage
logger.info(
    "order_created",
    order_id="ord-123",
    user_id="user-456",
    amount=99.99
)

// Node.js with Pino
const pino = require('pino');

const logger = pino({
  level: 'info',
  timestamp: () => `,"timestamp":"${new Date().toISOString()}"`,
  formatters: {
    level: (label) => ({ level: label })
  }
});

// Usage
logger.info({
  orderId: 'ord-123',
  userId: 'user-456',
  amount: 99.99
}, 'Order created');

// Java with SLF4J + Logback
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;

public class OrderService {
    private static final Logger logger = LoggerFactory.getLogger(OrderService.class);

    public void createOrder(Order order) {
        try {
            MDC.put("orderId", order.getId());
            MDC.put("userId", order.getUserId());

            logger.info("Order created: amount={} currency={}",
                order.getAmount(),
                order.getCurrency());
        } finally {
            MDC.clear();
        }
    }
}

26.4 What to Log

Request/Response Logging

# Good request logging
logger.info("api_request",
    method="POST",
    path="/api/orders",
    query={"status": "pending"},
    user_id=user_id,
    request_id=request_id,
    content_length=1024
)

# Response logging
logger.info("api_response",
    method="POST",
    path="/api/orders",
    status=201,
    duration_ms=150,
    response_order_id="ord-123"
)

Error Logging

# Always include context!
try:
    result = process_payment(order)
except PaymentException as e:
    logger.error("payment_failed",
        order_id=order.id,
        amount=order.amount,
        error_code=e.code,
        error_message=str(e),
        user_id=order.user_id,
        card_last_four=order.card_last_four,
        # Don't log: full card number, CVV, password!
        exc_info=True  # Includes stack trace
    )
    raise

Business Events

# Log important business events
logger.info("user_registered",
    user_id=user.id,
    email=user.email,
    referrer=user.referrer,
    signup_method="google"
)

logger.info("order_placed",
    order_id=order.id,
    user_id=order.user_id,
    item_count=len(order.items),
    total=order.total,
    coupon_used=order.coupon_code
)

logger.info("subscription_created",
    user_id=user.id,
    plan="premium",
    billing_cycle="monthly",
    amount=19.99
)

26.5 What NOT to Log

    CRITICAL: Never Log These!
    =========================

    ┌─────────────────────────────────────────────────────────────┐
    │  PASSWORD/SECRETS                                          │
    │  ─────────────────────────────────────────────────────────│
    │  ✗ User passwords                                         │
    │  ✗ API keys and tokens                                     │
    │  ✗ OAuth secrets                                           │
    │  ✗ Database connection strings                             │
    │  ✗ Private keys                                            │
    │                                                             │
    │  ✓ Log: "Login attempt" (without password)                │
    │  ✓ Log: "API call failed" (without keys)                  │
    └─────────────────────────────────────────────────────────────┘

    ┌─────────────────────────────────────────────────────────────┐
    │  PERSONAL IDENTIFIABLE INFORMATION (PII)                   │
    │  ─────────────────────────────────────────────────────────│
    │  ✗ Full credit card numbers                                │
    │  ✗ Social Security Numbers                                  │
    │  ✗ Phone numbers (unless necessary)                        │
    │  ✗ Home addresses                                          │
    │  ✗ Medical information                                     │
    │                                                             │
    │  ✓ Log: "Payment processed" (not card number)              │
    │  ✓ Log: User ID (not full name if not needed)              │
    └─────────────────────────────────────────────────────────────┘

    ┌─────────────────────────────────────────────────────────────┐
    │  SENSITIVE BUSINESS DATA                                    │
    │  ─────────────────────────────────────────────────────────│
    │  ✗ Full request/response bodies (can contain secrets)     │
    │  ✗ Customer data in bulk                                   │
    │  ✗ Internal system details that aid attackers             │
    └─────────────────────────────────────────────────────────────┘

Data Masking

import re

def mask_sensitive_data(message: str) -> str:
    """Mask sensitive data in log messages"""
    # Mask credit cards
    message = re.sub(r'\b\d{13,16}\b', '****-****-****-****', message)
    # Mask SSN
    message = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '***-**-****', message)
    # Mask API keys (common patterns)
    message = re.sub(r'(api[_-]?key["\']?\s*[:=]\s*["\']?)\w+', r'\1***', message)
    return message

26.6 Log Aggregation Architecture

    Log Aggregation Pipeline
    ========================

    ┌─────────────────────────────────────────────────────────────┐
    │  Application Servers                                        │
    │  ┌──────────┐  ┌──────────┐  ┌──────────┐                │
    │  │ Server 1 │  │ Server 2 │  │ Server N │                │
    │  │ app.log  │  │ app.log  │  │ app.log  │                │
    │  └────┬─────┘  └────┬─────┘  └────┬─────┘                │
    │       │             │             │                        │
    └───────┼─────────────┼─────────────┼────────────────────────┘
            │             │             │
            ▼             ▼             ▼
    ┌─────────────────────────────────────────────────────────────┐
    │  Log Shippers (Filebeat, Fluentd, CloudWatch Agent)        │
    │  ─────────────────────────────────────────────────────────│
    │  • Read log files                                          │
    │  • Parse/transform                                          │
    │  • Send to centralized system                              │
    └─────────────────────────────────────────────────────────────┘
            │
            ▼
    ┌─────────────────────────────────────────────────────────────┐
    │  Log Storage & Analysis                                    │
    │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
    │  │    ELK      │  │   Splunk   │  │  CloudWatch │        │
    │  │  Stack     │  │             │  │   Logs      │        │
    │  │(Elastic)  │  │             │  │             │        │
    │  └─────────────┘  └─────────────┘  └─────────────┘        │
    │                                                             │
    │  • Search & Query                                          │
    │  • Dashboards                                               │
    │  • Alerts                                                   │
    └─────────────────────────────────────────────────────────────┘

Popular Log Aggregation Tools

Tool	Type	Pros	Cons
ELK Stack	Open Source	Free, flexible	Complex setup
Splunk	Commercial	Powerful, enterprise	Expensive
CloudWatch	AWS Managed	Native AWS	AWS-only
DataDog	SaaS	Great UI	Cost
Sumo Logic	SaaS	Security focus	Pricing
Grafana + Loki	Open Source	Cost effective	Newer

26.7 Best Practices Summary

    Logging Best Practices Checklist
    ===============================

    ✓ Use structured JSON logging
    ✓ Include consistent context (request ID, user ID)
    ✓ Use appropriate log levels
    ✓ Log errors with full context and stack traces
    ✓ Never log passwords, secrets, or PII
    ✓ Implement log rotation
    ✓ Centralize logs for searching
    ✓ Set up alerts on error thresholds
    ✓ Monitor log volume to control costs
    ✓ Use correlation IDs for request tracing

    ─────────────────────────────────────────

    Log Volume Guidelines:
    ─────────────────────
    • INFO: Expected operations
    • WARN: Things that need attention
    • ERROR: Failures (always)
    • DEBUG: Enable in dev, selective in prod

    Target: 100MB - 1GB per instance per day
    (varies by application)

Summary

Log levels - FATAL > ERROR > WARN > INFO > DEBUG > TRACE
Structured logging - Use JSON format
Context - Always include request ID, user ID
Security - Never log secrets or PII
Centralize - Aggregate in ELK, Splunk, or cloud service
Correlation - Use trace/request ID to track requests
Rotation - Prevent disk overflow

Next: Chapter 27: Monitoring & Metrics