Monitoring

Chapter 27: Monitoring & Metrics

Understanding and Measuring System Behavior

27.1 The Three Pillars of Observability

Observability is the ability to understand your system’s internal state by examining its external outputs.

    The Three Pillars of Observability
    =================================

    ┌─────────────────────────────────────────────────────────────┐
    │                                                             │
    │                    OBSERVABILITY                            │
    │                                                             │
    │  ┌─────────────────┐  ┌─────────────────┐                │
    │  │     METRICS     │  │      LOGS       │                │
    │  │                 │  │                 │                │
    │  │  Numbers over   │  │  Discrete events│                │
    │  │  time           │  │  over time      │                │
    │  │                 │  │                 │                │
    │  │  CPU: 65%       │  │  "User logged   │                │
    │  │  Req/s: 500     │  │   in"           │                │
    │  │  Latency: 120ms │  │  "Order placed" │                │
    │  │                 │  │                 │                │
    │  └────────┬────────┘  └────────┬────────┘                │
    │           │                    │                          │
    │           └──────────┬─────────┘                         │
    │                      │                                    │
    │                      ▼                                    │
    │           ┌─────────────────────┐                         │
    │           │      TRACES         │                         │
    │           │                     │                         │
    │           │  Request path       │                         │
    │           │  through services   │                         │
    │           │                     │                         │
    │           │  API ──► DB ──► Auth│                         │
    │           └─────────────────────┘                         │
    │                                                             │
    └─────────────────────────────────────────────────────────────┘

How They Work Together

    Example: User reports slow checkout
    ==================================

    METRICS: "Checkout latency is 8 seconds (should be <2s)"

    ↓

    LOGS: "Database query took 7.5 seconds"

    ↓

    TRACES:
    POST /checkout (8s)
    ├── validate_cart (100ms)
    ├── check_inventory (200ms)
    ├── calculate_total (50ms)
    └── query_orders (7.5s) ← HERE!
        └── SELECT * FROM orders WHERE... (7.4s)

27.2 Key Metrics: RED and USE

27.2.1 RED Metrics (Rate, Errors, Duration)

Perfect for service-level monitoring:

    RED Metrics
    ==========

    ┌─────────────────────────────────────────────────────────────┐
    │  RATE - Requests per second                                │
    │  ─────────────────────────────────────────────────────────│
    │                                                             │
    │  Questions: How much traffic?                               │
    │                                                             │
    │  Example:                                                    │
    │  • 1,000 requests/second                                    │
    │  • Peak: 5,000 RPS                                         │
    │                                                             │
    │  Alert: > 80% of capacity                                  │
    └─────────────────────────────────────────────────────────────┘

    ┌─────────────────────────────────────────────────────────────┐
    │  ERRORS - Error rate (usually %)                          │
    │  ─────────────────────────────────────────────────────────│
    │                                                             │
    │  Questions: Is the service working?                        │
    │                                                             │
    │  Example:                                                    │
    │  • 0.5% error rate                                         │
    │  • 5 errors per minute                                     │
    │                                                             │
    │  Alert: > 1% errors (5xx)                                  │
    └─────────────────────────────────────────────────────────────┘

    ┌─────────────────────────────────────────────────────────────┐
    │  DURATION - Response time distribution                    │
    │  ─────────────────────────────────────────────────────────│
    │                                                             │
    │  Questions: How fast?                                       │
    │                                                             │
    │  Metrics:                                                   │
    │  • p50 (median): 100ms                                    │
    │  • p95: 500ms                                              │
    │  • p99: 2s                                                 │
    │                                                             │
    │  Alert: p99 > 2 seconds                                    │
    └─────────────────────────────────────────────────────────────┘

27.2.2 USE Metrics (Utilization, Saturation, Errors)

Perfect for resource monitoring:

    USE Metrics
    ==========

    ┌─────────────────────────────────────────────────────────────┐
    │  UTILIZATION - How busy is the resource?                  │
    │  ─────────────────────────────────────────────────────────│
    │                                                             │
    │  CPU: 70% busy                                             │
    │  Memory: 4GB / 8GB used                                    │
    │  Disk: 50% used                                           │
    │  Network: 100 Mbps / 1 Gbps                                │
    │                                                             │
    │  ⚠️  Warning: > 70% sustained                              │
    │  🔴 Alert: > 90%                                           │
    └─────────────────────────────────────────────────────────────┘

    ┌─────────────────────────────────────────────────────────────┐
    │  SATURATION - How full is the resource?                   │
    │  ─────────────────────────────────────────────────────────│
    │                                                             │
    │  CPU: Queue length, load average                           │
    │  Memory: Available, swap usage                             │
    │  Disk: Queue depth, IOPS available                        │
    │  Network: Connection limits                                │
    │                                                             │
    │  ⚠️  Warning: Near capacity                               │
    │  🔴 Alert: Saturated                                       │
    └─────────────────────────────────────────────────────────────┘

    ┌─────────────────────────────────────────────────────────────┐
    │  ERRORS - Internal errors                                  │
    │  ─────────────────────────────────────────────────────────│
    │                                                             │
    │  CPU: CPU temperature, throttling                          │
    │  Memory: OOM kills, allocation failures                     │
    │  Disk: I/O errors, filesystem errors                       │
    │                                                             │
    │  Always alert on any errors!                               │
    └─────────────────────────────────────────────────────────────┘

Which to Use When

Scenario	Use
API Service	RED metrics
Database	USE metrics
Queue	USE + pending message count
Cache	Hit rate, evictions
Load Balancer	Connections, throughput

27.3 Types of Metrics

Counters

    Counter - Only goes up
    =====================

    Example:
    • Total requests
    • Total errors
    • Orders placed

    Use for: Rates (calculate per second)

    Prometheus:
    ──────────────
    counter_total{type="http_requests"} 1523456

    Rate calculation:
    rate(http_requests_total[5m]) = requests/second

Gauges

    Gauge - Goes up and down
    ======================

    Example:
    • Current CPU %
    • Memory used
    • Number of connections
    • Queue depth

    Use for: Current state snapshots

    Prometheus:
    ──────────────
    gauge{cpu_percent="65"}
    gauge{memory_used_bytes="4294967296"}

Histograms

    Histogram - Distribution of values
    =================================

    Example:
    • Request duration
    • Response size
    • Query time

    Use for: Calculating percentiles

    Prometheus:
    ──────────────
    histogram_bucket{le="0.1"} 1000
    histogram_bucket{le="0.5"} 5000
    histogram_bucket{le="1.0"} 8000
    histogram_bucket{le="+Inf"} 10000

    Calculate percentiles:
    histogram_quantile(0.95, rate(...))

27.4 Metrics Collection Architecture

    Metrics Collection Pipeline
    =========================

    ┌─────────────────────────────────────────────────────────────┐
    │  Applications                                              │
    │  ┌──────────┐  ┌──────────┐  ┌──────────┐                │
    │  │ Service A│  │ Service B│  │ Service C│                │
    │  │          │  │          │  │          │                │
    │  │ Metrics  │  │ Metrics  │  │ Metrics  │                │
    │  │ Library  │  │ Library  │  │ Library  │                │
    │  └────┬─────┘  └────┬─────┘  └────┬─────┘                │
    │       │             │             │                        │
    └───────┼─────────────┼─────────────┼────────────────────────┘
            │             │             │
            ▼             ▼             ▼
    ┌─────────────────────────────────────────────────────────────┐
    │  Exporters (Pull or Push)                                   │
    │                                                             │
    │  PULL: Prometheus scrapes /metrics endpoint                │
    │  PUSH: App pushes to Pushgateway                          │
    └─────────────────────────────────────────────────────────────┘
            │
            ▼
    ┌─────────────────────────────────────────────────────────────┐
    │  Time Series Database                                       │
    │                                                             │
    │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
    │  │ Prometheus  │  │ InfluxDB   │  │ CloudWatch  │        │
    │  │             │  │             │  │             │        │
    │  └─────────────┘  └─────────────┘  └─────────────┘        │
    └─────────────────────────────────────────────────────────────┘
            │
            ▼
    ┌─────────────────────────────────────────────────────────────┐
    │  Visualization                                              │
    │                                                             │
    │  ┌─────────────┐  ┌─────────────┐                        │
    │  │   Grafana   │  │   Datadog   │                        │
    │  │             │  │             │                        │
    │  └─────────────┘  └─────────────┘                        │
    └─────────────────────────────────────────────────────────────┘

Implementing Metrics

# Python with Prometheus Client
from prometheus_client import Counter, Histogram, Gauge

# Counter - for rates
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Histogram - for durations
http_request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

# Gauge - for current values
active_connections = Gauge(
    'active_connections',
    'Number of active connections'
)

# Using in code
@app.route('/api/users')
def get_users():
    with http_request_duration.labels(method='GET', endpoint='/api/users').time():
        users = db.query(User)

    http_requests_total.labels(method='GET', endpoint='/api/users', status=200).inc()
    return users

// Node.js with Prometheus Client
const promClient = require('prom-client');

const httpRequestsTotal = new promClient.Counter({
    name: 'http_requests_total',
    help: 'Total HTTP requests',
    labelNames: ['method', 'endpoint', 'status']
});

const httpRequestDuration = new promClient.Histogram({
    name: 'http_request_duration_seconds',
    help: 'Request duration in seconds',
    labelNames: ['method', 'endpoint'],
    buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});

// Middleware
app.use((req, res, next) => {
    const start = Date.now();
    res.on('finish', () => {
        const duration = (Date.now() - start) / 1000;
        httpRequestsTotal.inc({ method: req.method, endpoint: req.path, status: res.statusCode });
        httpRequestDuration.observe({ method: req.method, endpoint: req.path }, duration);
    });
    next();
});

27.5 Important Application Metrics

Business Metrics

    Business Metrics (What matters to business)
    =========================================

    ┌──────────────────┬──────────────────────────────────────────┐
    │ Metric           │ Description                             │
    ├──────────────────┼──────────────────────────────────────────┤
    │ Orders/min       │ Revenue rate                           │
    │ Conversion rate  │ % visitors who buy                     │
    │ Active users     │ DAU/MAU                                │
    │ Signup rate      │ User growth                            │
    │ Cart abandonment │ Checkout flow health                   │
    │ API usage        │ Resource consumption                   │
    │ Error budget     │ SLO burn rate                          │
    └──────────────────┴──────────────────────────────────────────┘

Infrastructure Metrics

    Infrastructure Metrics
    =====================

    ┌──────────────────┬──────────────────────────────────────────┐
    │ Metric           │ Description                             │
    ├──────────────────┼──────────────────────────────────────────┤
    │ CPU %            │ Compute utilization                     │
    │ Memory %         │ RAM utilization                         │
    │ Disk %           │ Storage utilization                     │
    │ Disk I/O         │ Read/write operations                   │
    │ Network I/O       │ Bandwidth usage                         │
    │ Load average     │ System load                             │
    │ Open files       │ File descriptor usage                   │
    └──────────────────┴──────────────────────────────────────────┘

Application Metrics

    Application Metrics
    ==================

    ┌──────────────────┬──────────────────────────────────────────┐
    │ Metric           │ Description                             │
    ├──────────────────┼──────────────────────────────────────────┤
    │ Request rate     │ Throughput (RPS)                        │
    │ Error rate       │ % of 5xx responses                      │
    │ Latency p50/95/99│ Response time distribution             │
    │ Queue depth      │ Pending work                           │
    │ Cache hit rate   │ Cache effectiveness                     │
    │ DB query time    │ Database performance                    │
    │ Connection pool  │ DB connection usage                    │
    │ Thread pool      │ Thread utilization                      │
    └──────────────────┴──────────────────────────────────────────┘

27.6 Service Level Objectives (SLOs)

    SLO Definition
    =============

    SLO = Target level of reliability

    Examples:
    ─────────────────────────────────────────

    "99.9% of requests succeed"
    ─────────────────────────
    SLO: 99.9% availability
    Error budget: 0.1% = 43.8 min/month downtime allowed

    "95% of requests respond in < 500ms"
    ──────────────────────────────────────
    SLO: p95 latency < 500ms
    Exceeded if p95 > 500ms

    "99% of requests respond in < 2s"
    ──────────────────────────────────
    SLO: p99 latency < 2s
    Exceeded if p99 > 2s

Error Budgets

    Error Budget Calculation
    =======================

    SLO: 99.9% (monthly)
    ─────────────────────

    Total minutes in month: 43,200 (30 days)
    Allowed downtime: 0.1% = 43.2 minutes

    If you use 43 minutes of downtime:
    ⚠️  SLO breached!
    Must stop feature releases and fix issues.

    ──────────────────────────────────────

    SLO: 99.99% (monthly)
    ─────────────────────

    Total: 43,200 minutes
    Allowed: 0.01% = 4.32 minutes

    Much stricter! Only 4 minutes allowed.

27.7 Dashboards

    Dashboard Structure
    ==================

    ┌─────────────────────────────────────────────────────────────┐
    │                    Service Dashboard                        │
    ├─────────────────────────────────────────────────────────────┤
    │                                                              │
    │  ┌────────────────────┐  ┌────────────────────┐            │
    │  │   Request Rate    │  │   Error Rate      │            │
    │  │   ████████████    │  │   █               │            │
    │  │   500 RPS         │  │   0.1%            │            │
    │  └────────────────────┘  └────────────────────┘            │
    │                                                              │
    │  ┌────────────────────┐  ┌────────────────────┐            │
    │  │   Latency (p95)   │  │   CPU Usage       │            │
    │  │   ████████        │  │   ████████        │            │
    │  │   450ms           │  │   65%             │            │
    │  └────────────────────┘  └────────────────────┘            │
    │                                                              │
    │  ┌──────────────────────────────────────────┐              │
    │  │   Error Breakdown by Type                │              │
    │  │   ████████████████████░░░  80% Timeout   │              │
    │  │   █████░░░░░░░░░░░░░░░░░░  20% DB       │              │
    │  └──────────────────────────────────────────┘              │
    │                                                              │
    └─────────────────────────────────────────────────────────────┘

Summary

Three pillars - Metrics, logs, traces work together
RED metrics - Rate, Errors, Duration (for services)
USE metrics - Utilization, Saturation, Errors (for resources)
Counters - Cumulative, for rates
Gauges - Point-in-time values
Histograms - Distributions, for percentiles
SLOs - Target reliability levels
Error budgets - Allowed failure allocation

Next: Chapter 28: Alerting