Skip to content

Health_checks

Chapter 25: Health Checks & Service Discovery

Section titled “Chapter 25: Health Checks & Service Discovery”

Ensuring System Health and Dynamic Service Connectivity

Section titled “Ensuring System Health and Dynamic Service Connectivity”

Health checks are mechanisms to verify if a service is functioning correctly and able to handle requests.

Why Health Checks Matter
======================
Without Health Checks:
──────────────────────
Load Balancer Service Instance
│ │
│─── Request ──────────▶│ (service is down!)
│◀── Error! ───────────│
│ │
Result: Users see errors
─────────────────────────────────────────
With Health Checks:
──────────────────────
Load Balancer Service Instance
│ │
│─── Health Check ─────▶│ ✓ Healthy!
│◀── 200 OK ───────────│
│ │
│─── Request ──────────▶│ ✓ Success!
│◀── Response ─────────│

Liveness Probe
=============
Question: "Is the container/process running?"
Purpose:
──────────────────────
• Detect if container has crashed
• Detect deadlocks
• Detect infinite loops
Action on Failure:
──────────────────────
Kubernetes RESTARTS the container
─────────────────────────────────────────
Kubernetes Example:
──────────────────────
```yaml
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30 # Wait 30s before first check
periodSeconds: 10 # Check every 10 seconds
failureThreshold: 3 # Restart after 3 failures
timeoutSeconds: 5 # Timeout for each check
```
Readiness Probe
=============
Question: "Is the service ready to receive traffic?"
Purpose:
──────────────────────
• Service dependencies available?
• Database connected?
• Cache warmed up?
• Initial data loaded?
Action on Failure:
──────────────────────
Kubernetes REMOVES from service pool
─────────────────────────────────────────
Kubernetes Example:
──────────────────────
```yaml
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
successThreshold: 1
```
Why successThreshold: 1?
──────────────────────
Must immediately serve traffic after recovery
Startup Probe
=============
Question: "Has the application started?"
Purpose:
──────────────────────
• For slow-starting applications
• When liveness would be too aggressive
• During initialization
Action on Failure:
──────────────────────
Kubernetes RESTARTS after timeout
─────────────────────────────────────────
Kubernetes Example:
──────────────────────
```yaml
startupProbe:
httpGet:
path: /health/startup
port: 8080
failureThreshold: 30 # 30 * 10s = 5 minutes max
periodSeconds: 10
```
Use Case:
──────────────────────
• Legacy applications
• Complex initialization
• Waiting for dependent services

# Flask health endpoint
@app.route('/health/live')
def liveness():
return {'status': 'ok'}, 200
@app.route('/health/ready')
def readiness():
# Check dependencies
checks = {
'database': check_database(),
'redis': check_redis(),
'external_api': check_external_api()
}
# All checks must pass
if all(checks.values()):
return {'status': 'ready', 'checks': checks}, 200
else:
return {'status': 'not_ready', 'checks': checks}, 503
def check_database():
"""Check database connectivity"""
try:
with db.connection() as conn:
conn.execute("SELECT 1")
return True
except Exception:
return False
def check_redis():
"""Check Redis connectivity"""
try:
redis.ping()
return True
except Exception:
return False
def check_external_api():
"""Check external API availability"""
try:
response = requests.get(
'https://api.example.com/health',
timeout=2
)
return response.status_code == 200
except Exception:
return False

Service discovery allows services to find each other dynamically without hardcoded addresses.

Service Discovery Flow
====================
Without Service Discovery:
────────────────────────
Service A Config File Service B
│ │ │
│ Hardcoded IP? │ │
│───────────────────────▶│ │
│ │ 10.0.1.50 │
│◀──────────────────────│ │
│ │ │
│ Request to 10.0.1.50│ │
│─────────────────────────────────────────▶│
Problem: What if Service B moves?
────────────────────────────────────────────────
With Service Discovery:
────────────────────────
Service A Service Registry Service B
│ │ │
│ Where is B? │ │
│───────────────────▶│ │
│ │ 10.0.1.50 │
│◀──────────────────│ │
│ │ │
│ Request to B │ │
│─────────────────────────────────────────▶│
Service B can move, IP changes - Service A doesn't care!
Service Registry Pattern
======================
┌─────────────────────────────────────────────────────────────┐
│ Service Registry │
│ (Database of available services) │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Service: user-service │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ Instance 1: 10.0.1.50:8080 (healthy) │ │ │
│ │ │ Instance 2: 10.0.1.51:8080 (healthy) │ │ │
│ │ │ Instance 3: 10.0.1.52:8080 (unhealthy) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Service: order-service │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ Instance 1: 10.0.2.10:8080 (healthy) │ │ │
│ │ │ Instance 2: 10.0.2.11:8080 (healthy) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
# Service registers itself (heartbeat)
import requests
import time
def register_service():
"""Register with service registry"""
while True:
try:
requests.post(
'http://registry:8500/v1/agent/service/register',
json={
"ID": "user-service-1",
"Name": "user-service",
"Address": "10.0.1.50",
"Port": 8080,
"Check": {
"HTTP": "http://10.0.1.50:8080/health",
"Interval": "10s",
"DeregisterCriticalServiceAfter": "30s"
}
}
)
except Exception as e:
print(f"Failed to register: {e}")
time.sleep(10)

ToolTypeDescription
ConsulSelf-hostedHashiCorp’s service mesh
EurekaSelf-hostedNetflix’s service registry
etcdSelf-hostedDistributed key-value store
Kubernetes DNSBuilt-inDNS-based discovery
AWS Cloud MapManagedAWS service discovery
Service FabricManagedMicrosoft’s service mesh
Kubernetes DNS Service Discovery
=================================
When you create a Service in Kubernetes:
─────────────────────────────────────────
apiVersion: v1
kind: Service
metadata:
name: user-service
spec:
selector:
app: user-service
ports:
- port: 80
targetPort: 8080
DNS Records Created:
────────────────────
user-service.default.svc.cluster.local
├── A Record: 10.0.1.50
├── A Record: 10.0.1.51
└── A Record: 10.0.1.52
Usage:
────────────────────
# From another pod
curl http://user-service
# Or fully qualified
curl http://user-service.default.svc.cluster.local

Health Check Checklist
====================
✓ Separate liveness and readiness probes
✓ Liveness: Simple check (is process running?)
✓ Readiness: Deep check (can handle traffic?)
✓ Keep checks lightweight (< 100ms)
✓ Don't check unnecessary dependencies
✓ Return appropriate HTTP status codes
✓ Include version info in response
✓ Monitor health check failures
✓ Set appropriate timeouts
✓ Configure failure thresholds properly
─────────────────────────────────────────
Common Mistakes:
────────────────────────
✗ Liveness too aggressive (restarts healthy pods)
✗ Readiness too strict (never ready)
✗ Checking too many dependencies
✗ Long-running health checks
✗ No logging on health endpoint
✗ Not monitoring health check failures

  1. Liveness probe - Is the container running? (restarts if failed)
  2. Readiness probe - Ready for traffic? (removes from pool if failed)
  3. Startup probe - Has application started? (for slow-starting apps)
  4. Deep health checks - Check dependencies (DB, cache, external APIs)
  5. Service discovery - Dynamic service location without hardcoded IPs
  6. Kubernetes integration - Built-in probe support
  7. Registry - Central database of available services

Next: Chapter 26: Logging Best Practices