Cloud_monitoring

Chapter 95: Cloud Infrastructure Monitoring

Overview

Cloud monitoring is essential for maintaining application performance, availability, and cost optimization in cloud environments. This chapter covers monitoring solutions for AWS, Azure, and GCP, including metrics collection, alerting, logging, and distributed tracing. Understanding cloud-native monitoring tools is critical for DevOps and SRE roles.

95.1 AWS CloudWatch

CloudWatch Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                      AWS CLOUDWATCH ARCHITECTURE                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                    CloudWatch Components                           │   │
│   ├─────────────────────────────────────────────────────────────────┤   │
│   │                                                                  │   │
│   │  Metrics:                                                      │   │
│   │  - CPU, Memory, Network, Disk                                 │   │
│   │  - Custom application metrics                                  │   │
│   │  - High-resolution metrics (1 second)                        │   │
│   │                                                                  │   │
│   │  Logs:                                                         │   │
│   │  - Centralized log storage                                    │   │
│   │  - Log insights queries                                       │   │
│   │  - CloudWatch Logs Agent                                      │   │
│   │                                                                  │   │
│   │  Alarms:                                                       │   │
│   │  - Threshold-based alerts                                     │   │
│   │  - Composite alarms                                           │   │
│   │  - SNS notifications                                           │   │
│   │                                                                  │   │
│   │  Dashboards:                                                   │   │
│   │  - Custom visualizations                                      │   │
│   │  - Multi-region views                                         │   │
│   │                                                                  │   │
│   │  Events:                                                       │   │
│   │  - Scheduled events                                           │   │
│   │  - Event-driven automation                                    │   │
│   │                                                                  │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Core Metrics and Alarms

# ============================================================
# CLOUDWATCH METRICS
# ============================================================

# List available metrics
aws cloudwatch list-metrics --namespace AWS/EC2

# EC2 Detailed Monitoring
# Enable detailed monitoring for 1-minute granularity
aws ec2 monitor-instances --instance-ids i-1234567890abcdef0

# Get metric statistics
aws cloudwatch get-metric-statistics \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --start-time 2024-01-01T00:00:00Z \
    --end-time 2024-01-02T00:00:00Z \
    --period 3600 \
    --statistics Average,Maximum,Minimum

# Get EC2 metadata
aws ec2 describe-instance-status \
    --instance-ids i-1234567890abcdef0

# Custom metrics
aws cloudwatch put-metric-data \
    --namespace "Custom/Application" \
    --metric-name RequestCount \
    --value 100 \
    --timestamp 2024-01-01T00:00:00Z

# Metric with dimensions
aws cloudwatch put-metric-data \
    --namespace Custom/App \
    --metric-name Latency \
    --value 45 \
    --timestamp 2024-01-01T00:00:00Z \
    --dimensions InstanceId=i-123,Environment=prod

CloudWatch Alarms

# ============================================================
# CLOUDWATCH ALARMS
# ============================================================

# Create alarm
aws cloudwatch put-metric-alarm \
    --alarm-name high-cpu \
    --alarm-description "CPU > 80%" \
    --metric-name CPUUtilization \
    --namespace AWS/EC2 \
    --statistic Average \
    --period 300 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold \
    --evaluation-periods 2 \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts \
    --dimensions Name=InstanceId,Value=i-1234567890abcdef0

# Create composite alarm
aws cloudwatch put-composite-alarm \
    --alarm-name "HighCPUOrMemory" \
    --alarm-rule "(ALARM high-cpu) OR (ALARM high-memory)" \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts

# View alarms
aws cloudwatch describe-alarms

# Alarm history
aws cloudwatch describe-alarm-history \
    --alarm-name high-cpu

CloudWatch Logs

# ============================================================
# CLOUDWATCH LOGS
# ============================================================

# Create log group
aws logs create-log-group --log-group-name /aws/ec2/myapp

# Put log events
aws logs put-log-events \
    --log-group-name /aws/ec2/myapp \
    --log-stream-name app \
    --log-events '[{"timestamp":1234567890000,"message":"Application started"}]'

# Query logs
aws logs filter-log-events \
    --log-group-name /aws/ec2/myapp \
    --filter-pattern "ERROR"

# Insights query
aws logs start-query \
    --log-group-name /aws/ec2/myapp \
    --start-time 1640995200000 \
    --end-time 1641081600000 \
    --query-string "fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20"

CloudWatch Agent

# ============================================================
# CLOUDWATCH AGENT
# ============================================================

# Install
wget https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
sudo rpm -U ./amazon-cloudwatch-agent.rpm

# Configure via wizard
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

# Manual configuration
cat > /opt/aws/amazon-cloudwatch-agent/etc/common-config.json << 'EOF'
{
  "agent": {
    "run_as_user": "root"
  }
}
EOF

cat > /opt/aws/amazon-cloudwatch-agent/etc/cloudwatch-agent.json << 'EOF'
{
  "metrics": {
    "namespace": "Custom/Metrics",
    "metrics_collected": {
      "cpu": {
        "measurement": ["cpu_usage_idle", "cpu_usage_user"],
        "metrics_collection_interval": 60
      },
      "mem": {
        "measurement": ["mem_used", "mem_available"],
        "metrics_collection_interval": 60
      },
      "disk": {
        "measurement": ["disk_used_percent"],
        "metrics_collection_interval": 60,
        "resources": ["/"]
      }
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/myapp/*.log",
            "log_group_name": "/var/log/myapp",
            "log_stream_name": "{instance_id}"
          }
        ]
      }
    }
  }
}
EOF

# Start agent
sudo systemctl start amazon-cloudwatch-agent
sudo systemctl enable amazon-cloudwatch-agent

95.2 Azure Monitor

Azure Monitoring Components

┌─────────────────────────────────────────────────────────────────────────┐
│                    AZURE MONITOR COMPONENTS                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │  Metrics: Azure Monitor Metrics                                 │   │
│   │  - Platform metrics (built-in)                                │   │
│   │  - Custom metrics                                            │   │
│   │  - Guest OS metrics (via agent)                              │   │
│   │                                                                  │   │
│   │  Logs:                                                        │   │
│   │  - Log Analytics                                             │   │
│   │  - KQL queries                                               │   │
│   │  - Application Insights                                       │   │
│   │                                                                  │   │
│   │  Alerts:                                                      │   │
│   │  - Metric alerts                                             │   │
│   │  - Log alerts                                                │   │
│   │  - Smart detection                                           │   │
│   │                                                                  │   │
│   │  Application Monitoring:                                      │   │
│   │  - Application Insights                                       │   │
│   │  - Distributed tracing                                        │   │
│   │  - Live Metrics                                              │   │
│   │                                                                  │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Azure Monitoring Commands

# ============================================================
# AZURE MONITOR
# ============================================================

# Get metrics
az monitor metrics list \
    --resource /subscriptions/xxx/resourceGroups/mygroup/providers/Microsoft.Compute/virtualMachines/myvm \
    --metric-names "Percentage CPU,Available Memory Bytes"

# Create metric alert
az monitor metrics alert create \
    --name cpu-alert \
    --resource-group mygroup \
    --condition "avg Percentage CPU > 80" \
    --description "CPU usage high" \
    --evaluation-frequency 1m \
    --window-size 5m

# Create log alert
az monitor alert create \
    --name error-alert \
    --resource-group mygroup \
    --target "/subscriptions/xxx/..." \
    --condition "Syslog | where SeverityLevel == 'error' | count" \
    --description "Error detected"

# Log Analytics
az monitor log-analytics workspace create \
    --resource-group mygroup \
    --workspace-name myworkspace

# Query logs
az monitor log-analytics query \
    --workspace myworkspace \
    --query "Syslog | where TimeGenerated > ago(1h) | summarize count() by SeverityLevel"

95.3 GCP Cloud Monitoring

GCP Monitoring

# ============================================================
# GCP CLOUD MONITORING
# ============================================================

# List metrics
gcloud monitoring metrics list

# Describe metric
gcloud monitoring metrics-descriptions describe compute.googleapis.com/instance/cpu/utilization

# Create alerting policy
gcloud alpha monitoring policies create \
    --display-name="High CPU Alert" \
    --condition-display-name="CPU > 80%" \
    --condition-threshold-value=0.8 \
    --condition-threshold-duration=300s \
    --condition-filter="resource.type=\"gce_instance\" AND metric.type=\"compute.googleapis.com/instance/cpu/utilization\"" \
    --notification-channels=channels \
    --documentation-content="CPU usage exceeds 80%"

# Create uptime check
gcloud monitoring uptime-check-configs create \
    --display-name="My App Uptime" \
    --resource-type=STATIC_IP_CHECK \
    --hostname=example.com \
    --path=/health \
    --timeout=10s \
    --check-interval=60s

# View uptime status
gcloud monitoring uptime-check-get-status

95.4 Interview Questions

┌─────────────────────────────────────────────────────────────────────────┐
│                CLOUD MONITORING INTERVIEW QUESTIONS                        │
├─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q1: What is CloudWatch and what does it monitor?                        │
                                                                         │
A1:                                                                       │
- AWS monitoring service                                                │
- Collects metrics, logs, events                                        │
- EC2, RDS, Lambda, custom metrics                                     │
- Alarms, dashboards, insights                                         │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q2: How do you create a CloudWatch alarm?                              │
                                                                         │
A2:                                                                       │
aws cloudwatch put-metric-alarm \                                       │
    --alarm-name high-cpu \                                            \
    --metric-name CPUUtilization \                                      \
    --namespace AWS/EC2 \                                              \
    --threshold 80 \                                                   \
    --comparison-operator GreaterThanThreshold \                        \
    --evaluation-periods 2                                            \
    --period 300                                                       │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q3: What's the difference between CloudWatch and CloudWatch Logs?      │
                                                                         │
A3:                                                                       │
- CloudWatch Metrics: Numerical time-series data                      │
- CloudWatch Logs: Text-based log files from EC2, Lambda, etc.       │
- Logs can be queried with CloudWatch Logs Insights                    │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q4: How do you monitor custom applications in CloudWatch?             │
                                                                         │
A4:                                                                       │
- Use CloudWatch Agent for OS-level metrics                           │
- Put custom metrics via put-metric-data API                          \
- Use CloudWatch Embedded Metric Format for Lambda                    │
- Application Insights for .NET/Java apps                              │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q5: What is Azure Application Insights?                                │
                                                                         │
A5:                                                                       │
- Application performance monitoring                                    │
- Distributed tracing                                                  │
- Live Metrics stream                                                 │
- User telemetry and analytics                                        │
- Auto-detect anomalies                                              │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q6: How does GCP Cloud Monitoring work?                               │
                                                                         │
A6:                                                                       │
- Collects metrics from GCP services                                  │
- Uses MQL (Monitoring Query Language) for queries                    │
- Supports custom metrics via OpenTelemetry                          │
- Integrates with Cloud Logging                                      │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q7: What is the difference between metric and log alerts?              │
                                                                         │
A7:                                                                       │
- Metric alerts: Threshold-based on numerical data                     │
- Log alerts: Query-based on log entries                              │
- Metric alerts: Faster evaluation                                    │
- Log alerts: More complex conditions                                 │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q8: How do you implement distributed tracing?                          │
                                                                         │
A8:                                                                       │
- AWS X-Ray for AWS services                                         │
- Application Insights for Azure                                      │
- Cloud Trace for GCP                                                │
- OpenTelemetry for vendor-agnostic                                  │
- Trace context propagation through services                          │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q9: How do you monitor costs in cloud?                               │
                                                                         │
A9:                                                                       │
- AWS: Cost Explorer, Budgets, Cost Anomaly Detection                │
- Azure: Cost Management, Budgets alerts                            │
- GCP: Cloud Billing, Budgets, Recommender                          │
- Set budget alerts at 80%, 100% thresholds                        │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q10: What is a composite alarm?                                       │
                                                                         │
A10:                                                                      │
- Combines multiple alarms with AND/OR logic                        │
- Reduces alert fatigue                                              │
- Example: Alert when CPU OR Memory is high                         │
                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Quick Reference

# AWS CloudWatch
aws cloudwatch put-metric-data --namespace Custom --metric-name X --value Y
aws cloudwatch put-metric-alarm --alarm-name X --metric-name Y --threshold 80

# Azure Monitor
az monitor metrics list --resource /subscriptions/... --metric-names "CPU"
az monitor metrics alert create --name X --condition "avg CPU > 80"

# GCP Monitoring
gcloud monitoring metrics list
gcloud alpha monitoring policies create --display-name="X"

Summary

AWS: CloudWatch for metrics, logs, alarms, dashboards
Azure: Azure Monitor with Metrics, Logs, Application Insights
GCP: Cloud Monitoring with MQL, Cloud Trace, Logging
Monitoring: Essential for availability and performance

Next Chapter

Chapter 96: DevOps Best Practices

Last Updated: February 2026