Skip to content

Monitoring

Understanding and Measuring System Behavior

Section titled “Understanding and Measuring System Behavior”

Observability is the ability to understand your system’s internal state by examining its external outputs.

The Three Pillars of Observability
=================================
┌─────────────────────────────────────────────────────────────┐
│ │
│ OBSERVABILITY │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ METRICS │ │ LOGS │ │
│ │ │ │ │ │
│ │ Numbers over │ │ Discrete events│ │
│ │ time │ │ over time │ │
│ │ │ │ │ │
│ │ CPU: 65% │ │ "User logged │ │
│ │ Req/s: 500 │ │ in" │ │
│ │ Latency: 120ms │ │ "Order placed" │ │
│ │ │ │ │ │
│ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │
│ └──────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ TRACES │ │
│ │ │ │
│ │ Request path │ │
│ │ through services │ │
│ │ │ │
│ │ API ──► DB ──► Auth│ │
│ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Example: User reports slow checkout
==================================
METRICS: "Checkout latency is 8 seconds (should be <2s)"
LOGS: "Database query took 7.5 seconds"
TRACES:
POST /checkout (8s)
├── validate_cart (100ms)
├── check_inventory (200ms)
├── calculate_total (50ms)
└── query_orders (7.5s) ← HERE!
└── SELECT * FROM orders WHERE... (7.4s)

27.2.1 RED Metrics (Rate, Errors, Duration)

Section titled “27.2.1 RED Metrics (Rate, Errors, Duration)”

Perfect for service-level monitoring:

RED Metrics
==========
┌─────────────────────────────────────────────────────────────┐
│ RATE - Requests per second │
│ ─────────────────────────────────────────────────────────│
│ │
│ Questions: How much traffic? │
│ │
│ Example: │
│ • 1,000 requests/second │
│ • Peak: 5,000 RPS │
│ │
│ Alert: > 80% of capacity │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ ERRORS - Error rate (usually %) │
│ ─────────────────────────────────────────────────────────│
│ │
│ Questions: Is the service working? │
│ │
│ Example: │
│ • 0.5% error rate │
│ • 5 errors per minute │
│ │
│ Alert: > 1% errors (5xx) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ DURATION - Response time distribution │
│ ─────────────────────────────────────────────────────────│
│ │
│ Questions: How fast? │
│ │
│ Metrics: │
│ • p50 (median): 100ms │
│ • p95: 500ms │
│ • p99: 2s │
│ │
│ Alert: p99 > 2 seconds │
└─────────────────────────────────────────────────────────────┘

27.2.2 USE Metrics (Utilization, Saturation, Errors)

Section titled “27.2.2 USE Metrics (Utilization, Saturation, Errors)”

Perfect for resource monitoring:

USE Metrics
==========
┌─────────────────────────────────────────────────────────────┐
│ UTILIZATION - How busy is the resource? │
│ ─────────────────────────────────────────────────────────│
│ │
│ CPU: 70% busy │
│ Memory: 4GB / 8GB used │
│ Disk: 50% used │
│ Network: 100 Mbps / 1 Gbps │
│ │
│ ⚠️ Warning: > 70% sustained │
│ 🔴 Alert: > 90% │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ SATURATION - How full is the resource? │
│ ─────────────────────────────────────────────────────────│
│ │
│ CPU: Queue length, load average │
│ Memory: Available, swap usage │
│ Disk: Queue depth, IOPS available │
│ Network: Connection limits │
│ │
│ ⚠️ Warning: Near capacity │
│ 🔴 Alert: Saturated │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ ERRORS - Internal errors │
│ ─────────────────────────────────────────────────────────│
│ │
│ CPU: CPU temperature, throttling │
│ Memory: OOM kills, allocation failures │
│ Disk: I/O errors, filesystem errors │
│ │
│ Always alert on any errors! │
└─────────────────────────────────────────────────────────────┘
ScenarioUse
API ServiceRED metrics
DatabaseUSE metrics
QueueUSE + pending message count
CacheHit rate, evictions
Load BalancerConnections, throughput

Counter - Only goes up
=====================
Example:
• Total requests
• Total errors
• Orders placed
Use for: Rates (calculate per second)
Prometheus:
──────────────
counter_total{type="http_requests"} 1523456
Rate calculation:
rate(http_requests_total[5m]) = requests/second
Gauge - Goes up and down
======================
Example:
• Current CPU %
• Memory used
• Number of connections
• Queue depth
Use for: Current state snapshots
Prometheus:
──────────────
gauge{cpu_percent="65"}
gauge{memory_used_bytes="4294967296"}
Histogram - Distribution of values
=================================
Example:
• Request duration
• Response size
• Query time
Use for: Calculating percentiles
Prometheus:
──────────────
histogram_bucket{le="0.1"} 1000
histogram_bucket{le="0.5"} 5000
histogram_bucket{le="1.0"} 8000
histogram_bucket{le="+Inf"} 10000
Calculate percentiles:
histogram_quantile(0.95, rate(...))

Metrics Collection Pipeline
=========================
┌─────────────────────────────────────────────────────────────┐
│ Applications │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Service A│ │ Service B│ │ Service C│ │
│ │ │ │ │ │ │ │
│ │ Metrics │ │ Metrics │ │ Metrics │ │
│ │ Library │ │ Library │ │ Library │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
└───────┼─────────────┼─────────────┼────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ Exporters (Pull or Push) │
│ │
│ PULL: Prometheus scrapes /metrics endpoint │
│ PUSH: App pushes to Pushgateway │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Time Series Database │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Prometheus │ │ InfluxDB │ │ CloudWatch │ │
│ │ │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Visualization │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Grafana │ │ Datadog │ │
│ │ │ │ │ │
│ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
# Python with Prometheus Client
from prometheus_client import Counter, Histogram, Gauge
# Counter - for rates
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
# Histogram - for durations
http_request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint'],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
# Gauge - for current values
active_connections = Gauge(
'active_connections',
'Number of active connections'
)
# Using in code
@app.route('/api/users')
def get_users():
with http_request_duration.labels(method='GET', endpoint='/api/users').time():
users = db.query(User)
http_requests_total.labels(method='GET', endpoint='/api/users', status=200).inc()
return users
// Node.js with Prometheus Client
const promClient = require('prom-client');
const httpRequestsTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'endpoint', 'status']
});
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Request duration in seconds',
labelNames: ['method', 'endpoint'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});
// Middleware
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestsTotal.inc({ method: req.method, endpoint: req.path, status: res.statusCode });
httpRequestDuration.observe({ method: req.method, endpoint: req.path }, duration);
});
next();
});

Business Metrics (What matters to business)
=========================================
┌──────────────────┬──────────────────────────────────────────┐
│ Metric │ Description │
├──────────────────┼──────────────────────────────────────────┤
│ Orders/min │ Revenue rate │
│ Conversion rate │ % visitors who buy │
│ Active users │ DAU/MAU │
│ Signup rate │ User growth │
│ Cart abandonment │ Checkout flow health │
│ API usage │ Resource consumption │
│ Error budget │ SLO burn rate │
└──────────────────┴──────────────────────────────────────────┘
Infrastructure Metrics
=====================
┌──────────────────┬──────────────────────────────────────────┐
│ Metric │ Description │
├──────────────────┼──────────────────────────────────────────┤
│ CPU % │ Compute utilization │
│ Memory % │ RAM utilization │
│ Disk % │ Storage utilization │
│ Disk I/O │ Read/write operations │
│ Network I/O │ Bandwidth usage │
│ Load average │ System load │
│ Open files │ File descriptor usage │
└──────────────────┴──────────────────────────────────────────┘
Application Metrics
==================
┌──────────────────┬──────────────────────────────────────────┐
│ Metric │ Description │
├──────────────────┼──────────────────────────────────────────┤
│ Request rate │ Throughput (RPS) │
│ Error rate │ % of 5xx responses │
│ Latency p50/95/99│ Response time distribution │
│ Queue depth │ Pending work │
│ Cache hit rate │ Cache effectiveness │
│ DB query time │ Database performance │
│ Connection pool │ DB connection usage │
│ Thread pool │ Thread utilization │
└──────────────────┴──────────────────────────────────────────┘

SLO Definition
=============
SLO = Target level of reliability
Examples:
─────────────────────────────────────────
"99.9% of requests succeed"
─────────────────────────
SLO: 99.9% availability
Error budget: 0.1% = 43.8 min/month downtime allowed
"95% of requests respond in < 500ms"
──────────────────────────────────────
SLO: p95 latency < 500ms
Exceeded if p95 > 500ms
"99% of requests respond in < 2s"
──────────────────────────────────
SLO: p99 latency < 2s
Exceeded if p99 > 2s
Error Budget Calculation
=======================
SLO: 99.9% (monthly)
─────────────────────
Total minutes in month: 43,200 (30 days)
Allowed downtime: 0.1% = 43.2 minutes
If you use 43 minutes of downtime:
⚠️ SLO breached!
Must stop feature releases and fix issues.
──────────────────────────────────────
SLO: 99.99% (monthly)
─────────────────────
Total: 43,200 minutes
Allowed: 0.01% = 4.32 minutes
Much stricter! Only 4 minutes allowed.

Dashboard Structure
==================
┌─────────────────────────────────────────────────────────────┐
│ Service Dashboard │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────┐ ┌────────────────────┐ │
│ │ Request Rate │ │ Error Rate │ │
│ │ ████████████ │ │ █ │ │
│ │ 500 RPS │ │ 0.1% │ │
│ └────────────────────┘ └────────────────────┘ │
│ │
│ ┌────────────────────┐ ┌────────────────────┐ │
│ │ Latency (p95) │ │ CPU Usage │ │
│ │ ████████ │ │ ████████ │ │
│ │ 450ms │ │ 65% │ │
│ └────────────────────┘ └────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ Error Breakdown by Type │ │
│ │ ████████████████████░░░ 80% Timeout │ │
│ │ █████░░░░░░░░░░░░░░░░░░ 20% DB │ │
│ └──────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘

  1. Three pillars - Metrics, logs, traces work together
  2. RED metrics - Rate, Errors, Duration (for services)
  3. USE metrics - Utilization, Saturation, Errors (for resources)
  4. Counters - Cumulative, for rates
  5. Gauges - Point-in-time values
  6. Histograms - Distributions, for percentiles
  7. SLOs - Target reliability levels
  8. Error budgets - Allowed failure allocation

Next: Chapter 28: Alerting