Monitoring
Chapter 27: Monitoring & Metrics
Section titled “Chapter 27: Monitoring & Metrics”Understanding and Measuring System Behavior
Section titled “Understanding and Measuring System Behavior”27.1 The Three Pillars of Observability
Section titled “27.1 The Three Pillars of Observability”Observability is the ability to understand your system’s internal state by examining its external outputs.
The Three Pillars of Observability =================================
┌─────────────────────────────────────────────────────────────┐ │ │ │ OBSERVABILITY │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ METRICS │ │ LOGS │ │ │ │ │ │ │ │ │ │ Numbers over │ │ Discrete events│ │ │ │ time │ │ over time │ │ │ │ │ │ │ │ │ │ CPU: 65% │ │ "User logged │ │ │ │ Req/s: 500 │ │ in" │ │ │ │ Latency: 120ms │ │ "Order placed" │ │ │ │ │ │ │ │ │ └────────┬────────┘ └────────┬────────┘ │ │ │ │ │ │ └──────────┬─────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────┐ │ │ │ TRACES │ │ │ │ │ │ │ │ Request path │ │ │ │ through services │ │ │ │ │ │ │ │ API ──► DB ──► Auth│ │ │ └─────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘How They Work Together
Section titled “How They Work Together” Example: User reports slow checkout ==================================
METRICS: "Checkout latency is 8 seconds (should be <2s)"
↓
LOGS: "Database query took 7.5 seconds"
↓
TRACES: POST /checkout (8s) ├── validate_cart (100ms) ├── check_inventory (200ms) ├── calculate_total (50ms) └── query_orders (7.5s) ← HERE! └── SELECT * FROM orders WHERE... (7.4s)27.2 Key Metrics: RED and USE
Section titled “27.2 Key Metrics: RED and USE”27.2.1 RED Metrics (Rate, Errors, Duration)
Section titled “27.2.1 RED Metrics (Rate, Errors, Duration)”Perfect for service-level monitoring:
RED Metrics ==========
┌─────────────────────────────────────────────────────────────┐ │ RATE - Requests per second │ │ ─────────────────────────────────────────────────────────│ │ │ │ Questions: How much traffic? │ │ │ │ Example: │ │ • 1,000 requests/second │ │ • Peak: 5,000 RPS │ │ │ │ Alert: > 80% of capacity │ └─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐ │ ERRORS - Error rate (usually %) │ │ ─────────────────────────────────────────────────────────│ │ │ │ Questions: Is the service working? │ │ │ │ Example: │ │ • 0.5% error rate │ │ • 5 errors per minute │ │ │ │ Alert: > 1% errors (5xx) │ └─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐ │ DURATION - Response time distribution │ │ ─────────────────────────────────────────────────────────│ │ │ │ Questions: How fast? │ │ │ │ Metrics: │ │ • p50 (median): 100ms │ │ • p95: 500ms │ │ • p99: 2s │ │ │ │ Alert: p99 > 2 seconds │ └─────────────────────────────────────────────────────────────┘27.2.2 USE Metrics (Utilization, Saturation, Errors)
Section titled “27.2.2 USE Metrics (Utilization, Saturation, Errors)”Perfect for resource monitoring:
USE Metrics ==========
┌─────────────────────────────────────────────────────────────┐ │ UTILIZATION - How busy is the resource? │ │ ─────────────────────────────────────────────────────────│ │ │ │ CPU: 70% busy │ │ Memory: 4GB / 8GB used │ │ Disk: 50% used │ │ Network: 100 Mbps / 1 Gbps │ │ │ │ ⚠️ Warning: > 70% sustained │ │ 🔴 Alert: > 90% │ └─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐ │ SATURATION - How full is the resource? │ │ ─────────────────────────────────────────────────────────│ │ │ │ CPU: Queue length, load average │ │ Memory: Available, swap usage │ │ Disk: Queue depth, IOPS available │ │ Network: Connection limits │ │ │ │ ⚠️ Warning: Near capacity │ │ 🔴 Alert: Saturated │ └─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐ │ ERRORS - Internal errors │ │ ─────────────────────────────────────────────────────────│ │ │ │ CPU: CPU temperature, throttling │ │ Memory: OOM kills, allocation failures │ │ Disk: I/O errors, filesystem errors │ │ │ │ Always alert on any errors! │ └─────────────────────────────────────────────────────────────┘Which to Use When
Section titled “Which to Use When”| Scenario | Use |
|---|---|
| API Service | RED metrics |
| Database | USE metrics |
| Queue | USE + pending message count |
| Cache | Hit rate, evictions |
| Load Balancer | Connections, throughput |
27.3 Types of Metrics
Section titled “27.3 Types of Metrics”Counters
Section titled “Counters” Counter - Only goes up =====================
Example: • Total requests • Total errors • Orders placed
Use for: Rates (calculate per second)
Prometheus: ────────────── counter_total{type="http_requests"} 1523456
Rate calculation: rate(http_requests_total[5m]) = requests/secondGauges
Section titled “Gauges” Gauge - Goes up and down ======================
Example: • Current CPU % • Memory used • Number of connections • Queue depth
Use for: Current state snapshots
Prometheus: ────────────── gauge{cpu_percent="65"} gauge{memory_used_bytes="4294967296"}Histograms
Section titled “Histograms” Histogram - Distribution of values =================================
Example: • Request duration • Response size • Query time
Use for: Calculating percentiles
Prometheus: ────────────── histogram_bucket{le="0.1"} 1000 histogram_bucket{le="0.5"} 5000 histogram_bucket{le="1.0"} 8000 histogram_bucket{le="+Inf"} 10000
Calculate percentiles: histogram_quantile(0.95, rate(...))27.4 Metrics Collection Architecture
Section titled “27.4 Metrics Collection Architecture” Metrics Collection Pipeline =========================
┌─────────────────────────────────────────────────────────────┐ │ Applications │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Service A│ │ Service B│ │ Service C│ │ │ │ │ │ │ │ │ │ │ │ Metrics │ │ Metrics │ │ Metrics │ │ │ │ Library │ │ Library │ │ Library │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ └───────┼─────────────┼─────────────┼────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Exporters (Pull or Push) │ │ │ │ PULL: Prometheus scrapes /metrics endpoint │ │ PUSH: App pushes to Pushgateway │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Time Series Database │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Prometheus │ │ InfluxDB │ │ CloudWatch │ │ │ │ │ │ │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Visualization │ │ │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ Grafana │ │ Datadog │ │ │ │ │ │ │ │ │ └─────────────┘ └─────────────┘ │ └─────────────────────────────────────────────────────────────┘Implementing Metrics
Section titled “Implementing Metrics”# Python with Prometheus Clientfrom prometheus_client import Counter, Histogram, Gauge
# Counter - for rateshttp_requests_total = Counter( 'http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
# Histogram - for durationshttp_request_duration = Histogram( 'http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'], buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0])
# Gauge - for current valuesactive_connections = Gauge( 'active_connections', 'Number of active connections')
# Using in code@app.route('/api/users')def get_users(): with http_request_duration.labels(method='GET', endpoint='/api/users').time(): users = db.query(User)
http_requests_total.labels(method='GET', endpoint='/api/users', status=200).inc() return users// Node.js with Prometheus Clientconst promClient = require('prom-client');
const httpRequestsTotal = new promClient.Counter({ name: 'http_requests_total', help: 'Total HTTP requests', labelNames: ['method', 'endpoint', 'status']});
const httpRequestDuration = new promClient.Histogram({ name: 'http_request_duration_seconds', help: 'Request duration in seconds', labelNames: ['method', 'endpoint'], buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]});
// Middlewareapp.use((req, res, next) => { const start = Date.now(); res.on('finish', () => { const duration = (Date.now() - start) / 1000; httpRequestsTotal.inc({ method: req.method, endpoint: req.path, status: res.statusCode }); httpRequestDuration.observe({ method: req.method, endpoint: req.path }, duration); }); next();});27.5 Important Application Metrics
Section titled “27.5 Important Application Metrics”Business Metrics
Section titled “Business Metrics” Business Metrics (What matters to business) =========================================
┌──────────────────┬──────────────────────────────────────────┐ │ Metric │ Description │ ├──────────────────┼──────────────────────────────────────────┤ │ Orders/min │ Revenue rate │ │ Conversion rate │ % visitors who buy │ │ Active users │ DAU/MAU │ │ Signup rate │ User growth │ │ Cart abandonment │ Checkout flow health │ │ API usage │ Resource consumption │ │ Error budget │ SLO burn rate │ └──────────────────┴──────────────────────────────────────────┘Infrastructure Metrics
Section titled “Infrastructure Metrics” Infrastructure Metrics =====================
┌──────────────────┬──────────────────────────────────────────┐ │ Metric │ Description │ ├──────────────────┼──────────────────────────────────────────┤ │ CPU % │ Compute utilization │ │ Memory % │ RAM utilization │ │ Disk % │ Storage utilization │ │ Disk I/O │ Read/write operations │ │ Network I/O │ Bandwidth usage │ │ Load average │ System load │ │ Open files │ File descriptor usage │ └──────────────────┴──────────────────────────────────────────┘Application Metrics
Section titled “Application Metrics” Application Metrics ==================
┌──────────────────┬──────────────────────────────────────────┐ │ Metric │ Description │ ├──────────────────┼──────────────────────────────────────────┤ │ Request rate │ Throughput (RPS) │ │ Error rate │ % of 5xx responses │ │ Latency p50/95/99│ Response time distribution │ │ Queue depth │ Pending work │ │ Cache hit rate │ Cache effectiveness │ │ DB query time │ Database performance │ │ Connection pool │ DB connection usage │ │ Thread pool │ Thread utilization │ └──────────────────┴──────────────────────────────────────────┘27.6 Service Level Objectives (SLOs)
Section titled “27.6 Service Level Objectives (SLOs)” SLO Definition =============
SLO = Target level of reliability
Examples: ─────────────────────────────────────────
"99.9% of requests succeed" ───────────────────────── SLO: 99.9% availability Error budget: 0.1% = 43.8 min/month downtime allowed
"95% of requests respond in < 500ms" ────────────────────────────────────── SLO: p95 latency < 500ms Exceeded if p95 > 500ms
"99% of requests respond in < 2s" ────────────────────────────────── SLO: p99 latency < 2s Exceeded if p99 > 2sError Budgets
Section titled “Error Budgets” Error Budget Calculation =======================
SLO: 99.9% (monthly) ─────────────────────
Total minutes in month: 43,200 (30 days) Allowed downtime: 0.1% = 43.2 minutes
If you use 43 minutes of downtime: ⚠️ SLO breached! Must stop feature releases and fix issues.
──────────────────────────────────────
SLO: 99.99% (monthly) ─────────────────────
Total: 43,200 minutes Allowed: 0.01% = 4.32 minutes
Much stricter! Only 4 minutes allowed.27.7 Dashboards
Section titled “27.7 Dashboards” Dashboard Structure ==================
┌─────────────────────────────────────────────────────────────┐ │ Service Dashboard │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌────────────────────┐ ┌────────────────────┐ │ │ │ Request Rate │ │ Error Rate │ │ │ │ ████████████ │ │ █ │ │ │ │ 500 RPS │ │ 0.1% │ │ │ └────────────────────┘ └────────────────────┘ │ │ │ │ ┌────────────────────┐ ┌────────────────────┐ │ │ │ Latency (p95) │ │ CPU Usage │ │ │ │ ████████ │ │ ████████ │ │ │ │ 450ms │ │ 65% │ │ │ └────────────────────┘ └────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────┐ │ │ │ Error Breakdown by Type │ │ │ │ ████████████████████░░░ 80% Timeout │ │ │ │ █████░░░░░░░░░░░░░░░░░░ 20% DB │ │ │ └──────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘Summary
Section titled “Summary”- Three pillars - Metrics, logs, traces work together
- RED metrics - Rate, Errors, Duration (for services)
- USE metrics - Utilization, Saturation, Errors (for resources)
- Counters - Cumulative, for rates
- Gauges - Point-in-time values
- Histograms - Distributions, for percentiles
- SLOs - Target reliability levels
- Error budgets - Allowed failure allocation
Next: Chapter 28: Alerting