Cloud_monitoring
Chapter 95: Cloud Infrastructure Monitoring
Section titled “Chapter 95: Cloud Infrastructure Monitoring”Overview
Section titled “Overview”Cloud monitoring is essential for maintaining application performance, availability, and cost optimization in cloud environments. This chapter covers monitoring solutions for AWS, Azure, and GCP, including metrics collection, alerting, logging, and distributed tracing. Understanding cloud-native monitoring tools is critical for DevOps and SRE roles.
95.1 AWS CloudWatch
Section titled “95.1 AWS CloudWatch”CloudWatch Overview
Section titled “CloudWatch Overview”┌─────────────────────────────────────────────────────────────────────────┐│ AWS CLOUDWATCH ARCHITECTURE │├─────────────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ CloudWatch Components │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ │ ││ │ Metrics: │ ││ │ - CPU, Memory, Network, Disk │ ││ │ - Custom application metrics │ ││ │ - High-resolution metrics (1 second) │ ││ │ │ ││ │ Logs: │ ││ │ - Centralized log storage │ ││ │ - Log insights queries │ ││ │ - CloudWatch Logs Agent │ ││ │ │ ││ │ Alarms: │ ││ │ - Threshold-based alerts │ ││ │ - Composite alarms │ ││ │ - SNS notifications │ ││ │ │ ││ │ Dashboards: │ ││ │ - Custom visualizations │ ││ │ - Multi-region views │ ││ │ │ ││ │ Events: │ ││ │ - Scheduled events │ ││ │ - Event-driven automation │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘Core Metrics and Alarms
Section titled “Core Metrics and Alarms”# ============================================================# CLOUDWATCH METRICS# ============================================================
# List available metricsaws cloudwatch list-metrics --namespace AWS/EC2
# EC2 Detailed Monitoring# Enable detailed monitoring for 1-minute granularityaws ec2 monitor-instances --instance-ids i-1234567890abcdef0
# Get metric statisticsaws cloudwatch get-metric-statistics \ --namespace AWS/EC2 \ --metric-name CPUUtilization \ --start-time 2024-01-01T00:00:00Z \ --end-time 2024-01-02T00:00:00Z \ --period 3600 \ --statistics Average,Maximum,Minimum
# Get EC2 metadataaws ec2 describe-instance-status \ --instance-ids i-1234567890abcdef0
# Custom metricsaws cloudwatch put-metric-data \ --namespace "Custom/Application" \ --metric-name RequestCount \ --value 100 \ --timestamp 2024-01-01T00:00:00Z
# Metric with dimensionsaws cloudwatch put-metric-data \ --namespace Custom/App \ --metric-name Latency \ --value 45 \ --timestamp 2024-01-01T00:00:00Z \ --dimensions InstanceId=i-123,Environment=prodCloudWatch Alarms
Section titled “CloudWatch Alarms”# ============================================================# CLOUDWATCH ALARMS# ============================================================
# Create alarmaws cloudwatch put-metric-alarm \ --alarm-name high-cpu \ --alarm-description "CPU > 80%" \ --metric-name CPUUtilization \ --namespace AWS/EC2 \ --statistic Average \ --period 300 \ --threshold 80 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 2 \ --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts \ --dimensions Name=InstanceId,Value=i-1234567890abcdef0
# Create composite alarmaws cloudwatch put-composite-alarm \ --alarm-name "HighCPUOrMemory" \ --alarm-rule "(ALARM high-cpu) OR (ALARM high-memory)" \ --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts
# View alarmsaws cloudwatch describe-alarms
# Alarm historyaws cloudwatch describe-alarm-history \ --alarm-name high-cpuCloudWatch Logs
Section titled “CloudWatch Logs”# ============================================================# CLOUDWATCH LOGS# ============================================================
# Create log groupaws logs create-log-group --log-group-name /aws/ec2/myapp
# Put log eventsaws logs put-log-events \ --log-group-name /aws/ec2/myapp \ --log-stream-name app \ --log-events '[{"timestamp":1234567890000,"message":"Application started"}]'
# Query logsaws logs filter-log-events \ --log-group-name /aws/ec2/myapp \ --filter-pattern "ERROR"
# Insights queryaws logs start-query \ --log-group-name /aws/ec2/myapp \ --start-time 1640995200000 \ --end-time 1641081600000 \ --query-string "fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20"CloudWatch Agent
Section titled “CloudWatch Agent”# ============================================================# CLOUDWATCH AGENT# ============================================================
# Installwget https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpmsudo rpm -U ./amazon-cloudwatch-agent.rpm
# Configure via wizardsudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
# Manual configurationcat > /opt/aws/amazon-cloudwatch-agent/etc/common-config.json << 'EOF'{ "agent": { "run_as_user": "root" }}EOF
cat > /opt/aws/amazon-cloudwatch-agent/etc/cloudwatch-agent.json << 'EOF'{ "metrics": { "namespace": "Custom/Metrics", "metrics_collected": { "cpu": { "measurement": ["cpu_usage_idle", "cpu_usage_user"], "metrics_collection_interval": 60 }, "mem": { "measurement": ["mem_used", "mem_available"], "metrics_collection_interval": 60 }, "disk": { "measurement": ["disk_used_percent"], "metrics_collection_interval": 60, "resources": ["/"] } } }, "logs": { "logs_collected": { "files": { "collect_list": [ { "file_path": "/var/log/myapp/*.log", "log_group_name": "/var/log/myapp", "log_stream_name": "{instance_id}" } ] } } }}EOF
# Start agentsudo systemctl start amazon-cloudwatch-agentsudo systemctl enable amazon-cloudwatch-agent95.2 Azure Monitor
Section titled “95.2 Azure Monitor”Azure Monitoring Components
Section titled “Azure Monitoring Components”┌─────────────────────────────────────────────────────────────────────────┐│ AZURE MONITOR COMPONENTS │├─────────────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ Metrics: Azure Monitor Metrics │ ││ │ - Platform metrics (built-in) │ ││ │ - Custom metrics │ ││ │ - Guest OS metrics (via agent) │ ││ │ │ ││ │ Logs: │ ││ │ - Log Analytics │ ││ │ - KQL queries │ ││ │ - Application Insights │ ││ │ │ ││ │ Alerts: │ ││ │ - Metric alerts │ ││ │ - Log alerts │ ││ │ - Smart detection │ ││ │ │ ││ │ Application Monitoring: │ ││ │ - Application Insights │ ││ │ - Distributed tracing │ ││ │ - Live Metrics │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘Azure Monitoring Commands
Section titled “Azure Monitoring Commands”# ============================================================# AZURE MONITOR# ============================================================
# Get metricsaz monitor metrics list \ --resource /subscriptions/xxx/resourceGroups/mygroup/providers/Microsoft.Compute/virtualMachines/myvm \ --metric-names "Percentage CPU,Available Memory Bytes"
# Create metric alertaz monitor metrics alert create \ --name cpu-alert \ --resource-group mygroup \ --condition "avg Percentage CPU > 80" \ --description "CPU usage high" \ --evaluation-frequency 1m \ --window-size 5m
# Create log alertaz monitor alert create \ --name error-alert \ --resource-group mygroup \ --target "/subscriptions/xxx/..." \ --condition "Syslog | where SeverityLevel == 'error' | count" \ --description "Error detected"
# Log Analyticsaz monitor log-analytics workspace create \ --resource-group mygroup \ --workspace-name myworkspace
# Query logsaz monitor log-analytics query \ --workspace myworkspace \ --query "Syslog | where TimeGenerated > ago(1h) | summarize count() by SeverityLevel"95.3 GCP Cloud Monitoring
Section titled “95.3 GCP Cloud Monitoring”GCP Monitoring
Section titled “GCP Monitoring”# ============================================================# GCP CLOUD MONITORING# ============================================================
# List metricsgcloud monitoring metrics list
# Describe metricgcloud monitoring metrics-descriptions describe compute.googleapis.com/instance/cpu/utilization
# Create alerting policygcloud alpha monitoring policies create \ --display-name="High CPU Alert" \ --condition-display-name="CPU > 80%" \ --condition-threshold-value=0.8 \ --condition-threshold-duration=300s \ --condition-filter="resource.type=\"gce_instance\" AND metric.type=\"compute.googleapis.com/instance/cpu/utilization\"" \ --notification-channels=channels \ --documentation-content="CPU usage exceeds 80%"
# Create uptime checkgcloud monitoring uptime-check-configs create \ --display-name="My App Uptime" \ --resource-type=STATIC_IP_CHECK \ --hostname=example.com \ --path=/health \ --timeout=10s \ --check-interval=60s
# View uptime statusgcloud monitoring uptime-check-get-status95.4 Interview Questions
Section titled “95.4 Interview Questions”┌─────────────────────────────────────────────────────────────────────────┐│ CLOUD MONITORING INTERVIEW QUESTIONS │├─────────────────────────────────────────────────────────────────────────┤ │Q1: What is CloudWatch and what does it monitor? │ │A1: │- AWS monitoring service │- Collects metrics, logs, events │- EC2, RDS, Lambda, custom metrics │- Alarms, dashboards, insights │ │─────────────────────────────────────────────────────────────────────────┤ │Q2: How do you create a CloudWatch alarm? │ │A2: │aws cloudwatch put-metric-alarm \ │ --alarm-name high-cpu \ \ --metric-name CPUUtilization \ \ --namespace AWS/EC2 \ \ --threshold 80 \ \ --comparison-operator GreaterThanThreshold \ \ --evaluation-periods 2 \ --period 300 │ │─────────────────────────────────────────────────────────────────────────┤ │Q3: What's the difference between CloudWatch and CloudWatch Logs? │ │A3: │- CloudWatch Metrics: Numerical time-series data │- CloudWatch Logs: Text-based log files from EC2, Lambda, etc. │- Logs can be queried with CloudWatch Logs Insights │ │─────────────────────────────────────────────────────────────────────────┤ │Q4: How do you monitor custom applications in CloudWatch? │ │A4: │- Use CloudWatch Agent for OS-level metrics │- Put custom metrics via put-metric-data API \- Use CloudWatch Embedded Metric Format for Lambda │- Application Insights for .NET/Java apps │ │─────────────────────────────────────────────────────────────────────────┤ │Q5: What is Azure Application Insights? │ │A5: │- Application performance monitoring │- Distributed tracing │- Live Metrics stream │- User telemetry and analytics │- Auto-detect anomalies │ │─────────────────────────────────────────────────────────────────────────┤ │Q6: How does GCP Cloud Monitoring work? │ │A6: │- Collects metrics from GCP services │- Uses MQL (Monitoring Query Language) for queries │- Supports custom metrics via OpenTelemetry │- Integrates with Cloud Logging │ │─────────────────────────────────────────────────────────────────────────┤ │Q7: What is the difference between metric and log alerts? │ │A7: │- Metric alerts: Threshold-based on numerical data │- Log alerts: Query-based on log entries │- Metric alerts: Faster evaluation │- Log alerts: More complex conditions │ │─────────────────────────────────────────────────────────────────────────┤ │Q8: How do you implement distributed tracing? │ │A8: │- AWS X-Ray for AWS services │- Application Insights for Azure │- Cloud Trace for GCP │- OpenTelemetry for vendor-agnostic │- Trace context propagation through services │ │─────────────────────────────────────────────────────────────────────────┤ │Q9: How do you monitor costs in cloud? │ │A9: │- AWS: Cost Explorer, Budgets, Cost Anomaly Detection │- Azure: Cost Management, Budgets alerts │- GCP: Cloud Billing, Budgets, Recommender │- Set budget alerts at 80%, 100% thresholds │ │─────────────────────────────────────────────────────────────────────────┤ │Q10: What is a composite alarm? │ │A10: │- Combines multiple alarms with AND/OR logic │- Reduces alert fatigue │- Example: Alert when CPU OR Memory is high │ │└─────────────────────────────────────────────────────────────────────────┘Quick Reference
Section titled “Quick Reference”# AWS CloudWatchaws cloudwatch put-metric-data --namespace Custom --metric-name X --value Yaws cloudwatch put-metric-alarm --alarm-name X --metric-name Y --threshold 80
# Azure Monitoraz monitor metrics list --resource /subscriptions/... --metric-names "CPU"az monitor metrics alert create --name X --condition "avg CPU > 80"
# GCP Monitoringgcloud monitoring metrics listgcloud alpha monitoring policies create --display-name="X"Summary
Section titled “Summary”- AWS: CloudWatch for metrics, logs, alarms, dashboards
- Azure: Azure Monitor with Metrics, Logs, Application Insights
- GCP: Cloud Monitoring with MQL, Cloud Trace, Logging
- Monitoring: Essential for availability and performance
Next Chapter
Section titled “Next Chapter”Chapter 96: DevOps Best Practices
Last Updated: February 2026