Skip to content

Cloud_monitoring

Chapter 95: Cloud Infrastructure Monitoring

Section titled “Chapter 95: Cloud Infrastructure Monitoring”

Cloud monitoring is essential for maintaining application performance, availability, and cost optimization in cloud environments. This chapter covers monitoring solutions for AWS, Azure, and GCP, including metrics collection, alerting, logging, and distributed tracing. Understanding cloud-native monitoring tools is critical for DevOps and SRE roles.


┌─────────────────────────────────────────────────────────────────────────┐
│ AWS CLOUDWATCH ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ CloudWatch Components │ │
│ ├─────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ Metrics: │ │
│ │ - CPU, Memory, Network, Disk │ │
│ │ - Custom application metrics │ │
│ │ - High-resolution metrics (1 second) │ │
│ │ │ │
│ │ Logs: │ │
│ │ - Centralized log storage │ │
│ │ - Log insights queries │ │
│ │ - CloudWatch Logs Agent │ │
│ │ │ │
│ │ Alarms: │ │
│ │ - Threshold-based alerts │ │
│ │ - Composite alarms │ │
│ │ - SNS notifications │ │
│ │ │ │
│ │ Dashboards: │ │
│ │ - Custom visualizations │ │
│ │ - Multi-region views │ │
│ │ │ │
│ │ Events: │ │
│ │ - Scheduled events │ │
│ │ - Event-driven automation │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Terminal window
# ============================================================
# CLOUDWATCH METRICS
# ============================================================
# List available metrics
aws cloudwatch list-metrics --namespace AWS/EC2
# EC2 Detailed Monitoring
# Enable detailed monitoring for 1-minute granularity
aws ec2 monitor-instances --instance-ids i-1234567890abcdef0
# Get metric statistics
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--start-time 2024-01-01T00:00:00Z \
--end-time 2024-01-02T00:00:00Z \
--period 3600 \
--statistics Average,Maximum,Minimum
# Get EC2 metadata
aws ec2 describe-instance-status \
--instance-ids i-1234567890abcdef0
# Custom metrics
aws cloudwatch put-metric-data \
--namespace "Custom/Application" \
--metric-name RequestCount \
--value 100 \
--timestamp 2024-01-01T00:00:00Z
# Metric with dimensions
aws cloudwatch put-metric-data \
--namespace Custom/App \
--metric-name Latency \
--value 45 \
--timestamp 2024-01-01T00:00:00Z \
--dimensions InstanceId=i-123,Environment=prod
Terminal window
# ============================================================
# CLOUDWATCH ALARMS
# ============================================================
# Create alarm
aws cloudwatch put-metric-alarm \
--alarm-name high-cpu \
--alarm-description "CPU > 80%" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:alerts \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0
# Create composite alarm
aws cloudwatch put-composite-alarm \
--alarm-name "HighCPUOrMemory" \
--alarm-rule "(ALARM high-cpu) OR (ALARM high-memory)" \
--alarm-actions arn:aws:sns:us-east-1:123456789012:alerts
# View alarms
aws cloudwatch describe-alarms
# Alarm history
aws cloudwatch describe-alarm-history \
--alarm-name high-cpu
Terminal window
# ============================================================
# CLOUDWATCH LOGS
# ============================================================
# Create log group
aws logs create-log-group --log-group-name /aws/ec2/myapp
# Put log events
aws logs put-log-events \
--log-group-name /aws/ec2/myapp \
--log-stream-name app \
--log-events '[{"timestamp":1234567890000,"message":"Application started"}]'
# Query logs
aws logs filter-log-events \
--log-group-name /aws/ec2/myapp \
--filter-pattern "ERROR"
# Insights query
aws logs start-query \
--log-group-name /aws/ec2/myapp \
--start-time 1640995200000 \
--end-time 1641081600000 \
--query-string "fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20"
Terminal window
# ============================================================
# CLOUDWATCH AGENT
# ============================================================
# Install
wget https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
sudo rpm -U ./amazon-cloudwatch-agent.rpm
# Configure via wizard
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
# Manual configuration
cat > /opt/aws/amazon-cloudwatch-agent/etc/common-config.json << 'EOF'
{
"agent": {
"run_as_user": "root"
}
}
EOF
cat > /opt/aws/amazon-cloudwatch-agent/etc/cloudwatch-agent.json << 'EOF'
{
"metrics": {
"namespace": "Custom/Metrics",
"metrics_collected": {
"cpu": {
"measurement": ["cpu_usage_idle", "cpu_usage_user"],
"metrics_collection_interval": 60
},
"mem": {
"measurement": ["mem_used", "mem_available"],
"metrics_collection_interval": 60
},
"disk": {
"measurement": ["disk_used_percent"],
"metrics_collection_interval": 60,
"resources": ["/"]
}
}
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/myapp/*.log",
"log_group_name": "/var/log/myapp",
"log_stream_name": "{instance_id}"
}
]
}
}
}
}
EOF
# Start agent
sudo systemctl start amazon-cloudwatch-agent
sudo systemctl enable amazon-cloudwatch-agent

┌─────────────────────────────────────────────────────────────────────────┐
│ AZURE MONITOR COMPONENTS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Metrics: Azure Monitor Metrics │ │
│ │ - Platform metrics (built-in) │ │
│ │ - Custom metrics │ │
│ │ - Guest OS metrics (via agent) │ │
│ │ │ │
│ │ Logs: │ │
│ │ - Log Analytics │ │
│ │ - KQL queries │ │
│ │ - Application Insights │ │
│ │ │ │
│ │ Alerts: │ │
│ │ - Metric alerts │ │
│ │ - Log alerts │ │
│ │ - Smart detection │ │
│ │ │ │
│ │ Application Monitoring: │ │
│ │ - Application Insights │ │
│ │ - Distributed tracing │ │
│ │ - Live Metrics │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Terminal window
# ============================================================
# AZURE MONITOR
# ============================================================
# Get metrics
az monitor metrics list \
--resource /subscriptions/xxx/resourceGroups/mygroup/providers/Microsoft.Compute/virtualMachines/myvm \
--metric-names "Percentage CPU,Available Memory Bytes"
# Create metric alert
az monitor metrics alert create \
--name cpu-alert \
--resource-group mygroup \
--condition "avg Percentage CPU > 80" \
--description "CPU usage high" \
--evaluation-frequency 1m \
--window-size 5m
# Create log alert
az monitor alert create \
--name error-alert \
--resource-group mygroup \
--target "/subscriptions/xxx/..." \
--condition "Syslog | where SeverityLevel == 'error' | count" \
--description "Error detected"
# Log Analytics
az monitor log-analytics workspace create \
--resource-group mygroup \
--workspace-name myworkspace
# Query logs
az monitor log-analytics query \
--workspace myworkspace \
--query "Syslog | where TimeGenerated > ago(1h) | summarize count() by SeverityLevel"

Terminal window
# ============================================================
# GCP CLOUD MONITORING
# ============================================================
# List metrics
gcloud monitoring metrics list
# Describe metric
gcloud monitoring metrics-descriptions describe compute.googleapis.com/instance/cpu/utilization
# Create alerting policy
gcloud alpha monitoring policies create \
--display-name="High CPU Alert" \
--condition-display-name="CPU > 80%" \
--condition-threshold-value=0.8 \
--condition-threshold-duration=300s \
--condition-filter="resource.type=\"gce_instance\" AND metric.type=\"compute.googleapis.com/instance/cpu/utilization\"" \
--notification-channels=channels \
--documentation-content="CPU usage exceeds 80%"
# Create uptime check
gcloud monitoring uptime-check-configs create \
--display-name="My App Uptime" \
--resource-type=STATIC_IP_CHECK \
--hostname=example.com \
--path=/health \
--timeout=10s \
--check-interval=60s
# View uptime status
gcloud monitoring uptime-check-get-status

┌─────────────────────────────────────────────────────────────────────────┐
│ CLOUD MONITORING INTERVIEW QUESTIONS │
├─────────────────────────────────────────────────────────────────────────┤
Q1: What is CloudWatch and what does it monitor? │
A1: │
- AWS monitoring service │
- Collects metrics, logs, events │
- EC2, RDS, Lambda, custom metrics │
- Alarms, dashboards, insights │
─────────────────────────────────────────────────────────────────────────┤
Q2: How do you create a CloudWatch alarm? │
A2: │
aws cloudwatch put-metric-alarm \ │
--alarm-name high-cpu \ \
--metric-name CPUUtilization \ \
--namespace AWS/EC2 \ \
--threshold 80 \ \
--comparison-operator GreaterThanThreshold \ \
--evaluation-periods 2 \
--period 300 │
─────────────────────────────────────────────────────────────────────────┤
Q3: What's the difference between CloudWatch and CloudWatch Logs? │
A3: │
- CloudWatch Metrics: Numerical time-series data │
- CloudWatch Logs: Text-based log files from EC2, Lambda, etc. │
- Logs can be queried with CloudWatch Logs Insights │
─────────────────────────────────────────────────────────────────────────┤
Q4: How do you monitor custom applications in CloudWatch? │
A4: │
- Use CloudWatch Agent for OS-level metrics │
- Put custom metrics via put-metric-data API \
- Use CloudWatch Embedded Metric Format for Lambda │
- Application Insights for .NET/Java apps │
─────────────────────────────────────────────────────────────────────────┤
Q5: What is Azure Application Insights? │
A5: │
- Application performance monitoring │
- Distributed tracing │
- Live Metrics stream │
- User telemetry and analytics │
- Auto-detect anomalies │
─────────────────────────────────────────────────────────────────────────┤
Q6: How does GCP Cloud Monitoring work? │
A6: │
- Collects metrics from GCP services │
- Uses MQL (Monitoring Query Language) for queries │
- Supports custom metrics via OpenTelemetry │
- Integrates with Cloud Logging │
─────────────────────────────────────────────────────────────────────────┤
Q7: What is the difference between metric and log alerts? │
A7: │
- Metric alerts: Threshold-based on numerical data │
- Log alerts: Query-based on log entries │
- Metric alerts: Faster evaluation │
- Log alerts: More complex conditions │
─────────────────────────────────────────────────────────────────────────┤
Q8: How do you implement distributed tracing? │
A8: │
- AWS X-Ray for AWS services │
- Application Insights for Azure │
- Cloud Trace for GCP │
- OpenTelemetry for vendor-agnostic │
- Trace context propagation through services │
─────────────────────────────────────────────────────────────────────────┤
Q9: How do you monitor costs in cloud? │
A9: │
- AWS: Cost Explorer, Budgets, Cost Anomaly Detection │
- Azure: Cost Management, Budgets alerts │
- GCP: Cloud Billing, Budgets, Recommender │
- Set budget alerts at 80%, 100% thresholds │
─────────────────────────────────────────────────────────────────────────┤
Q10: What is a composite alarm? │
A10: │
- Combines multiple alarms with AND/OR logic │
- Reduces alert fatigue │
- Example: Alert when CPU OR Memory is high │
└─────────────────────────────────────────────────────────────────────────┘

Terminal window
# AWS CloudWatch
aws cloudwatch put-metric-data --namespace Custom --metric-name X --value Y
aws cloudwatch put-metric-alarm --alarm-name X --metric-name Y --threshold 80
# Azure Monitor
az monitor metrics list --resource /subscriptions/... --metric-names "CPU"
az monitor metrics alert create --name X --condition "avg CPU > 80"
# GCP Monitoring
gcloud monitoring metrics list
gcloud alpha monitoring policies create --display-name="X"

  • AWS: CloudWatch for metrics, logs, alarms, dashboards
  • Azure: Azure Monitor with Metrics, Logs, Application Insights
  • GCP: Cloud Monitoring with MQL, Cloud Trace, Logging
  • Monitoring: Essential for availability and performance

Chapter 96: DevOps Best Practices


Last Updated: February 2026