Health_checks
Chapter 25: Health Checks & Service Discovery
Section titled “Chapter 25: Health Checks & Service Discovery”Ensuring System Health and Dynamic Service Connectivity
Section titled “Ensuring System Health and Dynamic Service Connectivity”25.1 What are Health Checks?
Section titled “25.1 What are Health Checks?”Health checks are mechanisms to verify if a service is functioning correctly and able to handle requests.
Why Health Checks Matter ======================
Without Health Checks: ──────────────────────
Load Balancer Service Instance │ │ │─── Request ──────────▶│ (service is down!) │◀── Error! ───────────│ │ │ Result: Users see errors
─────────────────────────────────────────
With Health Checks: ──────────────────────
Load Balancer Service Instance │ │ │─── Health Check ─────▶│ ✓ Healthy! │◀── 200 OK ───────────│ │ │ │─── Request ──────────▶│ ✓ Success! │◀── Response ─────────│25.2 Types of Health Checks
Section titled “25.2 Types of Health Checks”25.2.1 Liveness Probe
Section titled “25.2.1 Liveness Probe” Liveness Probe =============
Question: "Is the container/process running?"
Purpose: ────────────────────── • Detect if container has crashed • Detect deadlocks • Detect infinite loops
Action on Failure: ────────────────────── Kubernetes RESTARTS the container
─────────────────────────────────────────
Kubernetes Example: ────────────────────── ```yaml livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 30 # Wait 30s before first check periodSeconds: 10 # Check every 10 seconds failureThreshold: 3 # Restart after 3 failures timeoutSeconds: 5 # Timeout for each check ```25.2.2 Readiness Probe
Section titled “25.2.2 Readiness Probe” Readiness Probe =============
Question: "Is the service ready to receive traffic?"
Purpose: ────────────────────── • Service dependencies available? • Database connected? • Cache warmed up? • Initial data loaded?
Action on Failure: ────────────────────── Kubernetes REMOVES from service pool
─────────────────────────────────────────
Kubernetes Example: ────────────────────── ```yaml readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 failureThreshold: 3 successThreshold: 1 ```
Why successThreshold: 1? ────────────────────── Must immediately serve traffic after recovery25.2.3 Startup Probe
Section titled “25.2.3 Startup Probe” Startup Probe =============
Question: "Has the application started?"
Purpose: ────────────────────── • For slow-starting applications • When liveness would be too aggressive • During initialization
Action on Failure: ────────────────────── Kubernetes RESTARTS after timeout
─────────────────────────────────────────
Kubernetes Example: ────────────────────── ```yaml startupProbe: httpGet: path: /health/startup port: 8080 failureThreshold: 30 # 30 * 10s = 5 minutes max periodSeconds: 10 ```
Use Case: ────────────────────── • Legacy applications • Complex initialization • Waiting for dependent services25.3 Implementing Health Endpoints
Section titled “25.3 Implementing Health Endpoints”Basic Health Endpoint
Section titled “Basic Health Endpoint”# Flask health endpoint@app.route('/health/live')def liveness(): return {'status': 'ok'}, 200
@app.route('/health/ready')def readiness(): # Check dependencies checks = { 'database': check_database(), 'redis': check_redis(), 'external_api': check_external_api() }
# All checks must pass if all(checks.values()): return {'status': 'ready', 'checks': checks}, 200 else: return {'status': 'not_ready', 'checks': checks}, 503Deep Health Checks
Section titled “Deep Health Checks”def check_database(): """Check database connectivity""" try: with db.connection() as conn: conn.execute("SELECT 1") return True except Exception: return False
def check_redis(): """Check Redis connectivity""" try: redis.ping() return True except Exception: return False
def check_external_api(): """Check external API availability""" try: response = requests.get( 'https://api.example.com/health', timeout=2 ) return response.status_code == 200 except Exception: return False25.4 Service Discovery
Section titled “25.4 Service Discovery”Service discovery allows services to find each other dynamically without hardcoded addresses.
Service Discovery Flow ====================
Without Service Discovery: ────────────────────────
Service A Config File Service B │ │ │ │ Hardcoded IP? │ │ │───────────────────────▶│ │ │ │ 10.0.1.50 │ │◀──────────────────────│ │ │ │ │ │ Request to 10.0.1.50│ │ │─────────────────────────────────────────▶│
Problem: What if Service B moves?
────────────────────────────────────────────────
With Service Discovery: ────────────────────────
Service A Service Registry Service B │ │ │ │ Where is B? │ │ │───────────────────▶│ │ │ │ 10.0.1.50 │ │◀──────────────────│ │ │ │ │ │ Request to B │ │ │─────────────────────────────────────────▶│
Service B can move, IP changes - Service A doesn't care!Service Registry
Section titled “Service Registry” Service Registry Pattern ======================
┌─────────────────────────────────────────────────────────────┐ │ Service Registry │ │ (Database of available services) │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Service: user-service │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ │ │ Instance 1: 10.0.1.50:8080 (healthy) │ │ │ │ │ │ Instance 2: 10.0.1.51:8080 (healthy) │ │ │ │ │ │ Instance 3: 10.0.1.52:8080 (unhealthy) │ │ │ │ │ └─────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Service: order-service │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ │ │ Instance 1: 10.0.2.10:8080 (healthy) │ │ │ │ │ │ Instance 2: 10.0.2.11:8080 (healthy) │ │ │ │ │ └─────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘Service Registration
Section titled “Service Registration”# Service registers itself (heartbeat)import requestsimport time
def register_service(): """Register with service registry""" while True: try: requests.post( 'http://registry:8500/v1/agent/service/register', json={ "ID": "user-service-1", "Name": "user-service", "Address": "10.0.1.50", "Port": 8080, "Check": { "HTTP": "http://10.0.1.50:8080/health", "Interval": "10s", "DeregisterCriticalServiceAfter": "30s" } } ) except Exception as e: print(f"Failed to register: {e}")
time.sleep(10)25.5 Popular Service Discovery Tools
Section titled “25.5 Popular Service Discovery Tools”| Tool | Type | Description |
|---|---|---|
| Consul | Self-hosted | HashiCorp’s service mesh |
| Eureka | Self-hosted | Netflix’s service registry |
| etcd | Self-hosted | Distributed key-value store |
| Kubernetes DNS | Built-in | DNS-based discovery |
| AWS Cloud Map | Managed | AWS service discovery |
| Service Fabric | Managed | Microsoft’s service mesh |
Kubernetes Service Discovery
Section titled “Kubernetes Service Discovery” Kubernetes DNS Service Discovery =================================
When you create a Service in Kubernetes: ─────────────────────────────────────────
apiVersion: v1 kind: Service metadata: name: user-service spec: selector: app: user-service ports: - port: 80 targetPort: 8080
DNS Records Created: ────────────────────
user-service.default.svc.cluster.local │ ├── A Record: 10.0.1.50 ├── A Record: 10.0.1.51 └── A Record: 10.0.1.52
Usage: ────────────────────
# From another pod curl http://user-service
# Or fully qualified curl http://user-service.default.svc.cluster.local25.6 Health Check Best Practices
Section titled “25.6 Health Check Best Practices” Health Check Checklist ====================
✓ Separate liveness and readiness probes ✓ Liveness: Simple check (is process running?) ✓ Readiness: Deep check (can handle traffic?) ✓ Keep checks lightweight (< 100ms) ✓ Don't check unnecessary dependencies ✓ Return appropriate HTTP status codes ✓ Include version info in response ✓ Monitor health check failures ✓ Set appropriate timeouts ✓ Configure failure thresholds properly
─────────────────────────────────────────
Common Mistakes: ────────────────────────
✗ Liveness too aggressive (restarts healthy pods) ✗ Readiness too strict (never ready) ✗ Checking too many dependencies ✗ Long-running health checks ✗ No logging on health endpoint ✗ Not monitoring health check failuresSummary
Section titled “Summary”- Liveness probe - Is the container running? (restarts if failed)
- Readiness probe - Ready for traffic? (removes from pool if failed)
- Startup probe - Has application started? (for slow-starting apps)
- Deep health checks - Check dependencies (DB, cache, external APIs)
- Service discovery - Dynamic service location without hardcoded IPs
- Kubernetes integration - Built-in probe support
- Registry - Central database of available services