Health_checks

Chapter 25: Health Checks & Service Discovery

Ensuring System Health and Dynamic Service Connectivity

25.1 What are Health Checks?

Health checks are mechanisms to verify if a service is functioning correctly and able to handle requests.

    Why Health Checks Matter
    ======================

    Without Health Checks:
    ──────────────────────

    Load Balancer           Service Instance
         │                        │
         │─── Request ──────────▶│ (service is down!)
         │◀── Error! ───────────│
         │                      │
    Result: Users see errors

    ─────────────────────────────────────────

    With Health Checks:
    ──────────────────────

    Load Balancer           Service Instance
         │                        │
         │─── Health Check ─────▶│ ✓ Healthy!
         │◀── 200 OK ───────────│
         │                      │
         │─── Request ──────────▶│ ✓ Success!
         │◀── Response ─────────│

25.2 Types of Health Checks

25.2.1 Liveness Probe

    Liveness Probe
    =============

    Question: "Is the container/process running?"

    Purpose:
    ──────────────────────
    • Detect if container has crashed
    • Detect deadlocks
    • Detect infinite loops

    Action on Failure:
    ──────────────────────
    Kubernetes RESTARTS the container

    ─────────────────────────────────────────

    Kubernetes Example:
    ──────────────────────
    ```yaml
    livenessProbe:
      httpGet:
        path: /health/live
        port: 8080
      initialDelaySeconds: 30    # Wait 30s before first check
      periodSeconds: 10          # Check every 10 seconds
      failureThreshold: 3        # Restart after 3 failures
      timeoutSeconds: 5          # Timeout for each check
    ```

25.2.2 Readiness Probe

    Readiness Probe
    =============

    Question: "Is the service ready to receive traffic?"

    Purpose:
    ──────────────────────
    • Service dependencies available?
    • Database connected?
    • Cache warmed up?
    • Initial data loaded?

    Action on Failure:
    ──────────────────────
    Kubernetes REMOVES from service pool

    ─────────────────────────────────────────

    Kubernetes Example:
    ──────────────────────
    ```yaml
    readinessProbe:
      httpGet:
        path: /health/ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      failureThreshold: 3
      successThreshold: 1
    ```

    Why successThreshold: 1?
    ──────────────────────
    Must immediately serve traffic after recovery

25.2.3 Startup Probe

    Startup Probe
    =============

    Question: "Has the application started?"

    Purpose:
    ──────────────────────
    • For slow-starting applications
    • When liveness would be too aggressive
    • During initialization

    Action on Failure:
    ──────────────────────
    Kubernetes RESTARTS after timeout

    ─────────────────────────────────────────

    Kubernetes Example:
    ──────────────────────
    ```yaml
    startupProbe:
      httpGet:
        path: /health/startup
        port: 8080
      failureThreshold: 30       # 30 * 10s = 5 minutes max
      periodSeconds: 10
    ```

    Use Case:
    ──────────────────────
    • Legacy applications
    • Complex initialization
    • Waiting for dependent services

25.3 Implementing Health Endpoints

Basic Health Endpoint

# Flask health endpoint
@app.route('/health/live')
def liveness():
    return {'status': 'ok'}, 200

@app.route('/health/ready')
def readiness():
    # Check dependencies
    checks = {
        'database': check_database(),
        'redis': check_redis(),
        'external_api': check_external_api()
    }

    # All checks must pass
    if all(checks.values()):
        return {'status': 'ready', 'checks': checks}, 200
    else:
        return {'status': 'not_ready', 'checks': checks}, 503

Deep Health Checks

def check_database():
    """Check database connectivity"""
    try:
        with db.connection() as conn:
            conn.execute("SELECT 1")
        return True
    except Exception:
        return False

def check_redis():
    """Check Redis connectivity"""
    try:
        redis.ping()
        return True
    except Exception:
        return False

def check_external_api():
    """Check external API availability"""
    try:
        response = requests.get(
            'https://api.example.com/health',
            timeout=2
        )
        return response.status_code == 200
    except Exception:
        return False

25.4 Service Discovery

Service discovery allows services to find each other dynamically without hardcoded addresses.

    Service Discovery Flow
    ====================

    Without Service Discovery:
    ────────────────────────

    Service A              Config File          Service B
       │                        │                  │
       │   Hardcoded IP?        │                  │
       │───────────────────────▶│                  │
       │                        │   10.0.1.50     │
       │◀──────────────────────│                  │
       │                        │                  │
       │   Request to 10.0.1.50│                  │
       │─────────────────────────────────────────▶│

    Problem: What if Service B moves?

    ────────────────────────────────────────────────

    With Service Discovery:
    ────────────────────────

    Service A          Service Registry        Service B
       │                    │                    │
       │   Where is B?     │                    │
       │───────────────────▶│                    │
       │                    │   10.0.1.50       │
       │◀──────────────────│                    │
       │                    │                    │
       │   Request to B    │                    │
       │─────────────────────────────────────────▶│

    Service B can move, IP changes - Service A doesn't care!

Service Registry

    Service Registry Pattern
    ======================

    ┌─────────────────────────────────────────────────────────────┐
    │                    Service Registry                         │
    │  (Database of available services)                          │
    │                                                             │
    │  ┌─────────────────────────────────────────────────────┐  │
    │  │  Service: user-service                              │  │
    │  │  ┌─────────────────────────────────────────────┐  │  │
    │  │  │ Instance 1: 10.0.1.50:8080 (healthy)        │  │  │
    │  │  │ Instance 2: 10.0.1.51:8080 (healthy)        │  │  │
    │  │  │ Instance 3: 10.0.1.52:8080 (unhealthy)       │  │  │
    │  │  └─────────────────────────────────────────────┘  │  │
    │  └─────────────────────────────────────────────────────┘  │
    │                                                             │
    │  ┌─────────────────────────────────────────────────────┐  │
    │  │  Service: order-service                             │  │
    │  │  ┌─────────────────────────────────────────────┐  │  │
    │  │  │ Instance 1: 10.0.2.10:8080 (healthy)        │  │  │
    │  │  │ Instance 2: 10.0.2.11:8080 (healthy)        │  │  │
    │  │  └─────────────────────────────────────────────┘  │  │
    │  └─────────────────────────────────────────────────────┘  │
    └─────────────────────────────────────────────────────────────┘

Service Registration

# Service registers itself (heartbeat)
import requests
import time

def register_service():
    """Register with service registry"""
    while True:
        try:
            requests.post(
                'http://registry:8500/v1/agent/service/register',
                json={
                    "ID": "user-service-1",
                    "Name": "user-service",
                    "Address": "10.0.1.50",
                    "Port": 8080,
                    "Check": {
                        "HTTP": "http://10.0.1.50:8080/health",
                        "Interval": "10s",
                        "DeregisterCriticalServiceAfter": "30s"
                    }
                }
            )
        except Exception as e:
            print(f"Failed to register: {e}")

        time.sleep(10)

25.5 Popular Service Discovery Tools

Tool	Type	Description
Consul	Self-hosted	HashiCorp’s service mesh
Eureka	Self-hosted	Netflix’s service registry
etcd	Self-hosted	Distributed key-value store
Kubernetes DNS	Built-in	DNS-based discovery
AWS Cloud Map	Managed	AWS service discovery
Service Fabric	Managed	Microsoft’s service mesh

Kubernetes Service Discovery

    Kubernetes DNS Service Discovery
    =================================

    When you create a Service in Kubernetes:
    ─────────────────────────────────────────

    apiVersion: v1
    kind: Service
    metadata:
      name: user-service
    spec:
      selector:
        app: user-service
      ports:
        - port: 80
          targetPort: 8080

    DNS Records Created:
    ────────────────────

    user-service.default.svc.cluster.local
    │
    ├── A Record: 10.0.1.50
    ├── A Record: 10.0.1.51
    └── A Record: 10.0.1.52

    Usage:
    ────────────────────

    # From another pod
    curl http://user-service

    # Or fully qualified
    curl http://user-service.default.svc.cluster.local

25.6 Health Check Best Practices

    Health Check Checklist
    ====================

    ✓ Separate liveness and readiness probes
    ✓ Liveness: Simple check (is process running?)
    ✓ Readiness: Deep check (can handle traffic?)
    ✓ Keep checks lightweight (< 100ms)
    ✓ Don't check unnecessary dependencies
    ✓ Return appropriate HTTP status codes
    ✓ Include version info in response
    ✓ Monitor health check failures
    ✓ Set appropriate timeouts
    ✓ Configure failure thresholds properly

    ─────────────────────────────────────────

    Common Mistakes:
    ────────────────────────

    ✗ Liveness too aggressive (restarts healthy pods)
    ✗ Readiness too strict (never ready)
    ✗ Checking too many dependencies
    ✗ Long-running health checks
    ✗ No logging on health endpoint
    ✗ Not monitoring health check failures

Summary

Liveness probe - Is the container running? (restarts if failed)
Readiness probe - Ready for traffic? (removes from pool if failed)
Startup probe - Has application started? (for slow-starting apps)
Deep health checks - Check dependencies (DB, cache, external APIs)
Service discovery - Dynamic service location without hardcoded IPs
Kubernetes integration - Built-in probe support
Registry - Central database of available services

Next: Chapter 26: Logging Best Practices