Chaos_engineering
Chapter 51: Chaos Engineering
Section titled “Chapter 51: Chaos Engineering”Chaos Engineering is the practice of intentionally introducing failures into a system to test its resilience and discover weaknesses before they cause outages in production.
What is Chaos Engineering?
Section titled “What is Chaos Engineering?”┌─────────────────────────────────────────────────────────────────────────────┐│ Chaos Engineering Overview │├─────────────────────────────────────────────────────────────────────────────┤│ ││ ┌───────────────────────────────────────────────────────────────────────┐ ││ │ Chaos Engineering Process │ ││ │ │ ││ │ 1. Define Steady State │ ││ │ ┌───────────────────────────────────────────────────────────────┐ │ ││ │ │ Measure normal behavior (latency, error rate, throughput) │ │ ││ │ └───────────────────────────────────────────────────────────────┘ │ ││ │ │ │ ││ │ ▼ │ ││ │ 2. Inject Failure │ ││ │ ┌───────────────────────────────────────────────────────────────┐ │ ││ │ │ Kill pods, introduce latency, simulate network failures │ │ ││ │ └───────────────────────────────────────────────────────────────┘ │ ││ │ │ │ ││ │ ▼ │ ││ │ 3. Observe │ ││ │ ┌───────────────────────────────────────────────────────────────┐ │ ││ │ │ Monitor system behavior and collect data │ │ ││ │ └───────────────────────────────────────────────────────────────┘ │ ││ │ │ │ ││ │ ▼ │ ││ │ 4. Learn & Improve │ ││ │ ┌───────────────────────────────────────────────────────────────┐ │ ││ │ │ Analyze results, fix issues, repeat │ │ ││ │ └───────────────────────────────────────────────────────────────┘ │ ││ │ │ ││ └───────────────────────────────────────────────────────────────────────┘ ││ ││ Principle: Build confidence through experiments ││ │└─────────────────────────────────────────────────────────────────────────────┘Chaos Engineering Tools
Section titled “Chaos Engineering Tools”┌─────────────────────────────────────────────────────────────────────────────┐│ Chaos Engineering Tools │├─────────────────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────┐ ┌─────────────────┐ ││ │ LitmusChaos │ │ Chaos Mesh │ ││ │ │ │ │ ││ │ - Kubernetes │ │ - Kubernetes │ ││ │ - CNCF project │ │ - Cloud-native │ ││ │ - Many experiments│ │ - Time travel │ ││ └─────────────────┘ └─────────────────┘ ││ ││ ┌─────────────────┐ ┌─────────────────┐ ││ │ Gremlin │ │ AWS Fault │ ││ │ │ │ Injection │ ││ │ - SaaS │ │ Simulator │ ││ │ - Easy to use │ │ - AWS-specific │ ││ │ - Safe mode │ │ - Controlled │ ││ └─────────────────┘ └─────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────────┘Kubernetes Chaos Experiments
Section titled “Kubernetes Chaos Experiments”Using LitmusChaos
Section titled “Using LitmusChaos”apiVersion: litmuschaos.io/v1alpha1kind: ChaosEnginemetadata: name: pod-kill-chaos namespace: litmusspec: appinfo: appns: "production" applabel: "app=myapp" chaosServiceAccount: litmus-admin experiments: - name: pod-delete spec: components: env: - name: TOTAL_CHAOS_DURATION value: '30' - name: CHAOS_INTERVAL value: '10' - name: FORCE value: 'false'Using Chaos Mesh
Section titled “Using Chaos Mesh”apiVersion: chaos-mesh.org/v1alpha1kind: PodChaosmetadata: name: pod-kill-example namespace: chaos-meshspec: action: pod-kill mode: one duration: '30s' selector: namespaces: - production labelSelectors: app: myappCommon Chaos Experiments
Section titled “Common Chaos Experiments”Network Experiments
Section titled “Network Experiments”apiVersion: chaos-mesh.org/v1alpha1kind: NetworkChaosmetadata: name: network-partition namespace: chaos-meshspec: action: partition mode: one duration: '60s' selector: namespaces: - production direction: both target: selector: namespaces: - production labelSelectors: role: databaseResource Experiments
Section titled “Resource Experiments”apiVersion: chaos-mesh.org/v1alpha1kind: StressChaosmetadata: name: stress-cpu namespace: chaos-meshspec: mode: one duration: '60s' selector: namespaces: - production stressors: cpu: workers: 2 load: 90IO Experiments
Section titled “IO Experiments”apiVersion: chaos-mesh.org/v1alpha1kind: IOChaosmetadata: name: io-latency namespace: chaos-meshspec: action: latency mode: one duration: '30s' selector: namespaces: - production volumepath: /data latency: '100ms'Building a Chaos Practice
Section titled “Building a Chaos Practice”┌─────────────────────────────────────────────────────────────────────────────┐│ Building Chaos Engineering Practice │├─────────────────────────────────────────────────────────────────────────────┤│ ││ ┌───────────────────────────────────────────────────────────────────────┐ ││ │ Starting Your Practice │ ││ │ │ ││ │ 1. Start Small │ ││ │ - Non-critical services first │ ││ │ - Low impact experiments │ ││ │ - Scheduled maintenance windows │ ││ │ │ ││ │ 2. Get Buy-in │ ││ │ - Present to leadership │ ││ │ - Show business value │ ││ │ - Define success metrics │ ││ │ │ ││ │ 3. Automate │ ││ │ - Integrate with CI/CD │ ││ │ - Run regularly │ ││ │ - Continuous improvement │ ││ │ │ ││ │ 4. Measure │ ││ │ - Track MTTR (Mean Time to Recovery) │ ││ │ - Measure resilience scores │ ││ │ - Document learnings │ ││ │ │ ││ └───────────────────────────────────────────────────────────────────────┘ ││ ││ Safety First: ││ ✓ Always have a rollback plan ││ ✓ Start with experiments that match your risk tolerance ││ ✓ Never experiment in production without proper safeguards ││ ✓ Communicate with stakeholders before experiments ││ │└─────────────────────────────────────────────────────────────────────────────┘Chaos Monkey for Kubernetes
Section titled “Chaos Monkey for Kubernetes”// chaos-monkey/main.go (Simplified)package main
import ( "fmt" "math/rand" "time"
"k8s.io/client-go/kubernetes" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1")
type ChaosMonkey struct { client *kubernetes.Clientset config Config}
func (cm *ChaosMonkey) killRandomPod(namespace string) { pods, _ := cm.client.CoreV1().Pods(namespace).List( metav1.ListOptions{LabelSelector: "app=myapp"}, )
if len(pods.Items) > 0 { randomPod := pods.Items[rand.Intn(len(pods.Items))] err := cm.client.CoreV1().Pods(namespace).Delete( randomPod.Name, &metav1.DeleteOptions{}, ) if err == nil { fmt.Printf("Deleted pod: %s\n", randomPod.Name) } }}
func (cm *ChaosMonkey) injectNetworkLatency(namespace string, delay string) { // Implementation would add network latency fmt.Printf("Injected %s latency to namespace %s\n", delay, namespace)}Summary
Section titled “Summary”In this chapter, you learned:
- Chaos Engineering: What it is and why it matters
- Tools: LitmusChaos, Chaos Mesh, Gremlin
- Kubernetes Experiments: Pod, network, resource, IO chaos
- Building a Practice: How to start and scale chaos engineering
- Safety: Best practices for safe experimentation