Chaos_engineering

Chapter 51: Chaos Engineering

Chaos Engineering is the practice of intentionally introducing failures into a system to test its resilience and discover weaknesses before they cause outages in production.

What is Chaos Engineering?

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Chaos Engineering Overview                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌───────────────────────────────────────────────────────────────────────┐ │
│   │                    Chaos Engineering Process                            │ │
│   │                                                                       │ │
│   │   1. Define Steady State                                              │ │
│   │   ┌───────────────────────────────────────────────────────────────┐ │ │
│   │   │   Measure normal behavior (latency, error rate, throughput)  │ │ │
│   │   └───────────────────────────────────────────────────────────────┘ │ │
│   │                              │                                        │ │
│   │                              ▼                                        │ │
│   │   2. Inject Failure                                                    │ │
│   │   ┌───────────────────────────────────────────────────────────────┐ │ │
│   │   │   Kill pods, introduce latency, simulate network failures     │ │ │
│   │   └───────────────────────────────────────────────────────────────┘ │ │
│   │                              │                                        │ │
│   │                              ▼                                        │ │
│   │   3. Observe                                                           │ │
│   │   ┌───────────────────────────────────────────────────────────────┐ │ │
│   │   │   Monitor system behavior and collect data                    │ │ │
│   │   └───────────────────────────────────────────────────────────────┘ │ │
│   │                              │                                        │ │
│   │                              ▼                                        │ │
│   │   4. Learn & Improve                                                  │ │
│   │   ┌───────────────────────────────────────────────────────────────┐ │ │
│   │   │   Analyze results, fix issues, repeat                         │ │ │
│   │   └───────────────────────────────────────────────────────────────┘ │ │
│   │                                                                       │ │
│   └───────────────────────────────────────────────────────────────────────┘ │
│                                                                             │
│   Principle: Build confidence through experiments                             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Chaos Engineering Tools

┌─────────────────────────────────────────────────────────────────────────────┐
│                         Chaos Engineering Tools                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────────────┐    ┌─────────────────┐                             │
│   │   LitmusChaos   │    │  Chaos Mesh     │                             │
│   │                 │    │                 │                             │
│   │ - Kubernetes    │    │ - Kubernetes    │                             │
│   │ - CNCF project │    │ - Cloud-native  │                             │
│   │ - Many experiments│  │ - Time travel   │                             │
│   └─────────────────┘    └─────────────────┘                             │
│                                                                             │
│   ┌─────────────────┐    ┌─────────────────┐                             │
│   │    Gremlin      │    │  AWS Fault     │                             │
│   │                 │    │  Injection     │                             │
│   │ - SaaS          │    │  Simulator     │                             │
│   │ - Easy to use   │    │ - AWS-specific │                             │
│   │ - Safe mode    │    │ - Controlled   │                             │
│   └─────────────────┘    └─────────────────┘                             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Kubernetes Chaos Experiments

Using LitmusChaos

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-kill-chaos
  namespace: litmus
spec:
  appinfo:
    appns: "production"
    applabel: "app=myapp"
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '30'
            - name: CHAOS_INTERVAL
              value: '10'
            - name: FORCE
              value: 'false'

Using Chaos Mesh

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-example
  namespace: chaos-mesh
spec:
  action: pod-kill
  mode: one
  duration: '30s'
  selector:
    namespaces:
      - production
    labelSelectors:
      app: myapp

Common Chaos Experiments

Network Experiments

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-partition
  namespace: chaos-mesh
spec:
  action: partition
  mode: one
  duration: '60s'
  selector:
    namespaces:
      - production
  direction: both
  target:
    selector:
      namespaces:
        - production
      labelSelectors:
        role: database

Resource Experiments

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: stress-cpu
  namespace: chaos-mesh
spec:
  mode: one
  duration: '60s'
  selector:
    namespaces:
      - production
  stressors:
    cpu:
      workers: 2
      load: 90

IO Experiments

apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-latency
  namespace: chaos-mesh
spec:
  action: latency
  mode: one
  duration: '30s'
  selector:
    namespaces:
      - production
  volumepath: /data
  latency: '100ms'

Building a Chaos Practice

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Building Chaos Engineering Practice                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌───────────────────────────────────────────────────────────────────────┐ │
│   │                    Starting Your Practice                              │ │
│   │                                                                       │ │
│   │   1. Start Small                                                      │ │
│   │      - Non-critical services first                                   │ │
│   │      - Low impact experiments                                         │ │
│   │      - Scheduled maintenance windows                                  │ │
│   │                                                                       │ │
│   │   2. Get Buy-in                                                       │ │
│   │      - Present to leadership                                          │ │
│   │      - Show business value                                           │ │
│   │      - Define success metrics                                         │ │
│   │                                                                       │ │
│   │   3. Automate                                                         │ │
│   │      - Integrate with CI/CD                                           │ │
│   │      - Run regularly                                                 │ │
│   │      - Continuous improvement                                         │ │
│   │                                                                       │ │
│   │   4. Measure                                                          │ │
│   │      - Track MTTR (Mean Time to Recovery)                            │ │
│   │      - Measure resilience scores                                     │ │
│   │      - Document learnings                                            │ │
│   │                                                                       │ │
│   └───────────────────────────────────────────────────────────────────────┘ │
│                                                                             │
│   Safety First:                                                            │
│   ✓ Always have a rollback plan                                            │
│   ✓ Start with experiments that match your risk tolerance                 │
│   ✓ Never experiment in production without proper safeguards              │
│   ✓ Communicate with stakeholders before experiments                      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Chaos Monkey for Kubernetes

// chaos-monkey/main.go (Simplified)
package main

import (
    "fmt"
    "math/rand"
    "time"

    "k8s.io/client-go/kubernetes"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

type ChaosMonkey struct {
    client *kubernetes.Clientset
    config Config
}

func (cm *ChaosMonkey) killRandomPod(namespace string) {
    pods, _ := cm.client.CoreV1().Pods(namespace).List(
        metav1.ListOptions{LabelSelector: "app=myapp"},
    )

    if len(pods.Items) > 0 {
        randomPod := pods.Items[rand.Intn(len(pods.Items))]
        err := cm.client.CoreV1().Pods(namespace).Delete(
            randomPod.Name,
            &metav1.DeleteOptions{},
        )
        if err == nil {
            fmt.Printf("Deleted pod: %s\n", randomPod.Name)
        }
    }
}

func (cm *ChaosMonkey) injectNetworkLatency(namespace string, delay string) {
    // Implementation would add network latency
    fmt.Printf("Injected %s latency to namespace %s\n", delay, namespace)
}

Summary

In this chapter, you learned:

Chaos Engineering: What it is and why it matters
Tools: LitmusChaos, Chaos Mesh, Gremlin
Kubernetes Experiments: Pod, network, resource, IO chaos
Building a Practice: How to start and scale chaos engineering
Safety: Best practices for safe experimentation

Next Steps

Chapter 52: Service Mesh