Skip to content

Chaos_engineering

Chaos Engineering is the practice of intentionally introducing failures into a system to test its resilience and discover weaknesses before they cause outages in production.

┌─────────────────────────────────────────────────────────────────────────────┐
│ Chaos Engineering Overview │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Chaos Engineering Process │ │
│ │ │ │
│ │ 1. Define Steady State │ │
│ │ ┌───────────────────────────────────────────────────────────────┐ │ │
│ │ │ Measure normal behavior (latency, error rate, throughput) │ │ │
│ │ └───────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ 2. Inject Failure │ │
│ │ ┌───────────────────────────────────────────────────────────────┐ │ │
│ │ │ Kill pods, introduce latency, simulate network failures │ │ │
│ │ └───────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ 3. Observe │ │
│ │ ┌───────────────────────────────────────────────────────────────┐ │ │
│ │ │ Monitor system behavior and collect data │ │ │
│ │ └───────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ 4. Learn & Improve │ │
│ │ ┌───────────────────────────────────────────────────────────────┐ │ │
│ │ │ Analyze results, fix issues, repeat │ │ │
│ │ └───────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ Principle: Build confidence through experiments │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ Chaos Engineering Tools │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ LitmusChaos │ │ Chaos Mesh │ │
│ │ │ │ │ │
│ │ - Kubernetes │ │ - Kubernetes │ │
│ │ - CNCF project │ │ - Cloud-native │ │
│ │ - Many experiments│ │ - Time travel │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Gremlin │ │ AWS Fault │ │
│ │ │ │ Injection │ │
│ │ - SaaS │ │ Simulator │ │
│ │ - Easy to use │ │ - AWS-specific │ │
│ │ - Safe mode │ │ - Controlled │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
chaosengine.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-kill-chaos
namespace: litmus
spec:
appinfo:
appns: "production"
applabel: "app=myapp"
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-example
namespace: chaos-mesh
spec:
action: pod-kill
mode: one
duration: '30s'
selector:
namespaces:
- production
labelSelectors:
app: myapp
network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-partition
namespace: chaos-mesh
spec:
action: partition
mode: one
duration: '60s'
selector:
namespaces:
- production
direction: both
target:
selector:
namespaces:
- production
labelSelectors:
role: database
stress-cpu.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: stress-cpu
namespace: chaos-mesh
spec:
mode: one
duration: '60s'
selector:
namespaces:
- production
stressors:
cpu:
workers: 2
load: 90
io-latency.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: io-latency
namespace: chaos-mesh
spec:
action: latency
mode: one
duration: '30s'
selector:
namespaces:
- production
volumepath: /data
latency: '100ms'
┌─────────────────────────────────────────────────────────────────────────────┐
│ Building Chaos Engineering Practice │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Starting Your Practice │ │
│ │ │ │
│ │ 1. Start Small │ │
│ │ - Non-critical services first │ │
│ │ - Low impact experiments │ │
│ │ - Scheduled maintenance windows │ │
│ │ │ │
│ │ 2. Get Buy-in │ │
│ │ - Present to leadership │ │
│ │ - Show business value │ │
│ │ - Define success metrics │ │
│ │ │ │
│ │ 3. Automate │ │
│ │ - Integrate with CI/CD │ │
│ │ - Run regularly │ │
│ │ - Continuous improvement │ │
│ │ │ │
│ │ 4. Measure │ │
│ │ - Track MTTR (Mean Time to Recovery) │ │
│ │ - Measure resilience scores │ │
│ │ - Document learnings │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ Safety First: │
│ ✓ Always have a rollback plan │
│ ✓ Start with experiments that match your risk tolerance │
│ ✓ Never experiment in production without proper safeguards │
│ ✓ Communicate with stakeholders before experiments │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
// chaos-monkey/main.go (Simplified)
package main
import (
"fmt"
"math/rand"
"time"
"k8s.io/client-go/kubernetes"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
type ChaosMonkey struct {
client *kubernetes.Clientset
config Config
}
func (cm *ChaosMonkey) killRandomPod(namespace string) {
pods, _ := cm.client.CoreV1().Pods(namespace).List(
metav1.ListOptions{LabelSelector: "app=myapp"},
)
if len(pods.Items) > 0 {
randomPod := pods.Items[rand.Intn(len(pods.Items))]
err := cm.client.CoreV1().Pods(namespace).Delete(
randomPod.Name,
&metav1.DeleteOptions{},
)
if err == nil {
fmt.Printf("Deleted pod: %s\n", randomPod.Name)
}
}
}
func (cm *ChaosMonkey) injectNetworkLatency(namespace string, delay string) {
// Implementation would add network latency
fmt.Printf("Injected %s latency to namespace %s\n", delay, namespace)
}

In this chapter, you learned:

  • Chaos Engineering: What it is and why it matters
  • Tools: LitmusChaos, Chaos Mesh, Gremlin
  • Kubernetes Experiments: Pod, network, resource, IO chaos
  • Building a Practice: How to start and scale chaos engineering
  • Safety: Best practices for safe experimentation