Devops_best_practices

Chapter 96: DevOps and SysAdmin Best Practices

Production System Administration Guidelines

96.1 Automation Principles

Core Principles

                 DevOps Principles
+------------------------------------------------------------------+
|                                                                   |
|    1. Automate Everything                                          |
|    +----------------------------------------------------------+  |
|    | • Manual processes are error-prone                        |  |
|    | • Scripts for all repetitive tasks                       |  |
|    | • Configuration management (Ansible, Puppet, Chef)       |  |
|    +----------------------------------------------------------+  |
|                                                                   |
|    2. Idempotent Configurations                                   |
|    +----------------------------------------------------------+  |
|    | • Running multiple times produces same result             |  |
|    | • Ansible: idempotent by design                         |  |
|    +----------------------------------------------------------+  |
|                                                                   |
|    3. Infrastructure as Code                                      |
|    +----------------------------------------------------------+  |
|    | • Version control infrastructure                        |  |
|    | • GitOps workflow                                      |  |
|    | • Declarative definitions                               |  |
|    +----------------------------------------------------------+  |
|                                                                   |
|    4. Immutable Infrastructure                                    |
|    +----------------------------------------------------------+  |
|    | • Don't modify running systems                          |  |
|    | • Replace with new versions                             |  |
|    | • Containers, golden images                            |  |
|    +----------------------------------------------------------+  |
|                                                                   |
+------------------------------------------------------------------+

96.2 Monitoring Best Practices

Key Metrics to Track

# Infrastructure Metrics
- CPU usage (per core, overall)
- Memory usage (used, free, cached, swap)
- Disk I/O (IOPS, throughput, latency)
- Network (bandwidth, packets, errors)
- Disk space usage

# Application Metrics
- Request rate (RPM/RPS)
- Response time (p50, p95, p99)
- Error rate (5xx, exceptions)
- Active connections
- Queue depth

# Business Metrics
- User signups
- Transactions
- Revenue
- API calls

Alerting Principles

                 Alert Best Practices
+------------------------------------------------------------------+
|                                                                   |
|    1. Signal-to-Noise Ratio                                       |
|    +----------------------------------------------------------+  |
|    | • Only alert on actionable issues                        |  |
|    | • Avoid alert fatigue                                   |  |
|    +----------------------------------------------------------+  |
|                                                                   |
|    2. Severity Levels                                            |
|    +----------------------------------------------------------+  |
|    | • Critical (immediate action)                          |  |
|    | • Warning (investigate soon)                           |  |
|    | • Info (no action needed)                             |  |
|    +----------------------------------------------------------+  |
|                                                                   |
|    3. Runbooks                                                   |
|    +----------------------------------------------------------+  |
|    | • Document how to respond to each alert                |  |
|    | • Include escalation paths                              |  |
|    +----------------------------------------------------------+  |
|                                                                   |
+------------------------------------------------------------------+

96.3 Security Best Practices

Defense in Depth

                 Security Layers
+------------------------------------------------------------------+
|                                                                   |
|    1. Network Security                                           |
|    +----------------------------------------------------------+  |
|    | • Firewalls (host, network)                            |  |
|    | • Segmentation (VPCs, VLANs)                           |  |
|    | • WAF for web applications                              |  |
|    +----------------------------------------------------------+  |
|                                                                   |
|    2. System Hardening                                           |
|    +----------------------------------------------------------+  |
|    | • Principle of least privilege                          |  |
|    | • Regular patching and updates                          |  |
|    | • Disable unnecessary services                         |  |
|    +----------------------------------------------------------+  |
|                                                                   |
|    3. Data Security                                             |
|    +----------------------------------------------------------+  |
|    | • Encryption at rest                                    |  |
|    | • Encryption in transit (TLS)                          |  |
|    | • Key management (secrets, vault)                     |  |
|    +----------------------------------------------------------+  |
|                                                                   |
|    4. Monitoring and Response                                    |
|    +----------------------------------------------------------+  |
|    | • Audit logging                                        |  |
|    | • Intrusion detection                                  |  |
|    | • Incident response plan                               |  |
|    +----------------------------------------------------------+  |
|                                                                   |
+------------------------------------------------------------------+

96.4 Backup Best Practices

3-2-1 Rule

                 Backup Strategy
+------------------------------------------------------------------+
|                                                                   |
|    3 Copies of data                                               |
|    2 Different storage types                                      |
|    1 Offsite copy                                                |
|                                                                   |
|    Testing:                                                       |
|    +----------------------------------------------------------+  |
|    | • Test restores regularly                               |  |
|    | • Document recovery procedures                          |  |
|    | • Automate recovery testing                            |  |
|    +----------------------------------------------------------+  |
|                                                                   |
+------------------------------------------------------------------+

96.5 Documentation Standards

What to Document

# System Documentation
- Architecture diagrams
- Network topology
- IP addressing scheme
- Service dependencies

# Runbooks
- Deployment procedures
- Troubleshooting guides
- Emergency contacts
- Rollback procedures

# Configuration
- All configurations
- Why changes were made
- Approval records

96.6 Interview Questions

Basic Questions

What is Infrastructure as Code?
- Managing infrastructure through code
What is the 3-2-1 backup rule?
- 3 copies, 2 media types, 1 offsite
What is principle of least privilege?
- Only minimum access needed

Summary

                 Quick Reference
+------------------------------------------------------------------+
|                                                                   |
|    Key Principles:                                                |
|    +----------------------------------------------------------+  |
|    | Automate everything                                      |  |
|    | Monitor proactively                                       |  |
|    | Security in depth                                        |  |
|    | Test backups regularly                                   |  |
|    | Document everything                                      |  |
|    +----------------------------------------------------------+  |
|                                                                   |
+------------------------------------------------------------------+