Cloudwatch
Chapter 36: Amazon CloudWatch - Metrics & Alarms
Section titled “Chapter 36: Amazon CloudWatch - Metrics & Alarms”Monitoring and Observability
Section titled “Monitoring and Observability”36.1 Overview
Section titled “36.1 Overview”Amazon CloudWatch is a monitoring and observability service that provides data and actionable insights for AWS, hybrid, and on-premises applications.
Amazon CloudWatch Overview+------------------------------------------------------------------+| || +------------------------+ || | Amazon CloudWatch | || +------------------------+ || | || +---------------------+---------------------+ || | | | | || v v v v || +----------+ +----------+ +----------+ +----------+ || | Metrics | | Alarms | | Logs | | Events | || | | | | | | | | || | - System | | - Thresh | | - Stream | | - Rules | || | - Custom | | - Actions| | - Query | | - Target | || | - Dash | | - Compos | | - Insights| | - Schedule| || +----------+ +----------+ +----------+ +----------+ || |+------------------------------------------------------------------+Key Features
Section titled “Key Features”| Feature | Description |
|---|---|
| Metrics | System and custom metrics |
| Alarms | Threshold-based notifications |
| Logs | Log aggregation and analysis |
| Events | Event-driven automation |
| Dashboards | Visual monitoring |
| ServiceLens | End-to-end tracing |
36.2 CloudWatch Metrics
Section titled “36.2 CloudWatch Metrics”Metric Types
Section titled “Metric Types” CloudWatch Metric Types+------------------------------------------------------------------+| || Standard Metrics (AWS Services) || +------------------------------------------------------------+ || | | || | EC2 Metrics: | || | +------------------------------------------------------+ | || | | - CPUUtilization | | || | | - NetworkIn/NetworkOut | | || | | - DiskReadOps/DiskWriteOps | | || | | - StatusCheckFailed | | || | +------------------------------------------------------+ | || | | || | RDS Metrics: | || | +------------------------------------------------------+ | || | | - CPUUtilization | | || | | - FreeStorageSpace | | || | | - DatabaseConnections | | || | | - ReadIOPS/WriteIOPS | | || | +------------------------------------------------------+ | || | | || | Lambda Metrics: | || | +------------------------------------------------------+ | || | | - Invocations | | || | | - Duration | | || | | - Errors | | || | | - Throttles | | || | +------------------------------------------------------+ | || | | || +------------------------------------------------------------+ || || Custom Metrics || +------------------------------------------------------------+ || | | || | Sources: | || | +------------------------------------------------------+ | || | | - CloudWatch Agent (memory, disk, processes) | | || | | - Application code (PutMetricData API) | | || | | - CloudWatch Embedded Metric Format (EMF) | | || | | - StatsD, collectd | | || | +------------------------------------------------------+ | || | | || +------------------------------------------------------------+ || |+------------------------------------------------------------------+Metric Structure
Section titled “Metric Structure” CloudWatch Metric Structure+------------------------------------------------------------------+| || Metric || +------------------------------------------------------------+ || | | || | Namespace: AWS/EC2 (or custom namespace) | || | MetricName: CPUUtilization | || | Dimensions: | || | - InstanceId: i-1234567890abcdef0 | || | - InstanceType: t3.micro | || | Timestamp: 2024-01-15T12:00:00Z | || | Value: 75.5 | || | Unit: Percent | || | | || +------------------------------------------------------------+ || || Dimensions || +------------------------------------------------------------+ || | | || | - Key-value pairs for filtering | || | - Up to 10 dimensions per metric | || | - Examples: InstanceId, ServiceName, FunctionName | || | | || +------------------------------------------------------------+ || |+------------------------------------------------------------------+Metric Statistics
Section titled “Metric Statistics” CloudWatch Metric Statistics+------------------------------------------------------------------+| || Statistics || +------------------------------------------------------------+ || | | || | Basic Statistics: | || | +------------------------------------------------------+ | || | | - Average (avg) | | || | | - Minimum (min) | | || | | - Maximum (max) | | || | | - SampleCount | | || | | - Sum | | || | +------------------------------------------------------+ | || | | || | Extended Statistics (Percentiles): | || | +------------------------------------------------------+ | || | | - p50 (median) | | || | | - p90, p95, p99 | | || | | - TM (trimmed mean) | | || | | - TS (trimmed sum) | | || | +------------------------------------------------------+ | || | | || +------------------------------------------------------------+ || || Periods || +------------------------------------------------------------+ || | | || | - 1 minute (high-resolution metrics) | || | - 5 minutes (standard) | || | - 15 minutes (detailed) | || | - 1 hour, 1 day | || | | || +------------------------------------------------------------+ || |+------------------------------------------------------------------+36.3 CloudWatch Alarms
Section titled “36.3 CloudWatch Alarms”Alarm States
Section titled “Alarm States” CloudWatch Alarm States+------------------------------------------------------------------+| || +------------------------+ || | Alarm States | || +------------------------+ || | || +---------------------+---------------------+ || | | | || v v v || +----------+ +----------+ +----------+ || | OK | | ALARM | |INSUFFICIENT || | | | | | DATA | || | - Normal | | - Thresh | | - Not | || | State | | Exceed | | enough | || | | | | | data | || +----------+ +----------+ +----------+ || |+------------------------------------------------------------------+Alarm Configuration
Section titled “Alarm Configuration” CloudWatch Alarm Configuration+------------------------------------------------------------------+| || Alarm Components || +------------------------------------------------------------+ || | | || | Metric: | || | +------------------------------------------------------+ | || | | - Namespace, MetricName, Dimensions | | || | +------------------------------------------------------+ | || | | || | Threshold: | || | +------------------------------------------------------+ | || | | - ComparisonOperator (>=, <=, >, <, etc.) | | || | | - Threshold value | | || | | - EvaluationPeriods (number of periods) | | || | | - DatapointsToAlarm (breaching datapoints) | | || | +------------------------------------------------------+ | || | | || | Actions: | || | +------------------------------------------------------+ | || | | - AlarmActions (when ALARM state) | | || | | - OKActions (when OK state) | | || | | - InsufficientDataActions | | || | +------------------------------------------------------+ | || | | || +------------------------------------------------------------+ || |+------------------------------------------------------------------+Alarm Actions
Section titled “Alarm Actions” CloudWatch Alarm Actions+------------------------------------------------------------------+| || Notification Actions || +------------------------------------------------------------+ || | - SNS notifications | || | - Email, SMS, HTTP endpoints | || +------------------------------------------------------------+ || || Auto Scaling Actions || +------------------------------------------------------------+ || | - ScaleOut (add instances) | || | - ScaleIn (remove instances) | || +------------------------------------------------------------+ || || EC2 Actions || +------------------------------------------------------------+ || | - Stop, Terminate, Reboot, Recover instances | || +------------------------------------------------------------+ || || Systems Manager Actions || +------------------------------------------------------------+ || | - Start Automation document | || +------------------------------------------------------------+ || |+------------------------------------------------------------------+Composite Alarms
Section titled “Composite Alarms” Composite Alarms+------------------------------------------------------------------+| || Combine multiple alarms with logic || +------------------------------------------------------------+ || | | || | Example: | || | +------------------------------------------------------+ | || | | ALARM(cpu_high) AND ALARM(memory_high) | | || | | | | || | | - Trigger only when both conditions are met | | || | | - Reduce false positives | | || | | - Complex monitoring scenarios | | || | +------------------------------------------------------+ | || | | || | Operators: | || | +------------------------------------------------------+ | || | | - AND (all conditions must be true) | | || | | - OR (any condition must be true) | | || | | - NOT (negate condition) | | || | +------------------------------------------------------+ | || | | || +------------------------------------------------------------+ || |+------------------------------------------------------------------+36.4 CloudWatch Dashboards
Section titled “36.4 CloudWatch Dashboards”Dashboard Overview
Section titled “Dashboard Overview” CloudWatch Dashboards+------------------------------------------------------------------+| || Dashboard Features || +------------------------------------------------------------+ || | | || | Widgets: | || | +------------------------------------------------------+ | || | | - Line charts | | || | | - Stacked area charts | | || | | - Number widgets | | || | | - Gauge widgets | | || | | - Text widgets | | || | | - Log table widgets | | || | | - Alarm status widgets | | || | +------------------------------------------------------+ | || | | || | Sharing: | || | +------------------------------------------------------+ | || | | - Share dashboards externally | | || | | - Embed in external websites | | || | | - Cross-account dashboards | | || | +------------------------------------------------------+ | || | | || +------------------------------------------------------------+ || |+------------------------------------------------------------------+36.5 CloudWatch Logs
Section titled “36.5 CloudWatch Logs”Logs Overview
Section titled “Logs Overview” CloudWatch Logs Architecture+------------------------------------------------------------------+| || +------------------------+ || | CloudWatch Logs | || +------------------------+ || | || +---------------------+---------------------+ || | | | | || v v v v || +----------+ +----------+ +----------+ +----------+ || | Log | | Log | | Metric | | Insights | || | Groups | | Streams | | Filters | | | || | | | | | | | | || | - Retent | | - Source | | - Pattern| | - Query | || | - Encrypt| | - Seq | | - Metric | | - Analyze| || +----------+ +----------+ +----------+ +----------+ || |+------------------------------------------------------------------+Log Groups and Streams
Section titled “Log Groups and Streams” CloudWatch Logs Structure+------------------------------------------------------------------+| || Log Group || +------------------------------------------------------------+ || | | || | /aws/lambda/my-function | || | +--------------------------------------------------------+ | || | | Log Stream: 2024-01-15/[$LATEST]abc123 | | || | | Log Stream: 2024-01-15/[$LATEST]def456 | | || | +--------------------------------------------------------+ | || | | || | Configuration: | || | +------------------------------------------------------+ | || | | - Retention (1 day to 10 years, or never expire) | | || | | - Encryption (KMS) | | || | | - Subscription filters | | || | +------------------------------------------------------+ | || | | || +------------------------------------------------------------+ || || Log Stream || +------------------------------------------------------------+ || | | || | - Sequence of log events | || | - Usually one per source (instance, function, etc.) | || | - Ordered by timestamp | || | | || +------------------------------------------------------------+ || |+------------------------------------------------------------------+Metric Filters
Section titled “Metric Filters” CloudWatch Metric Filters+------------------------------------------------------------------+| || Create metrics from log data || +------------------------------------------------------------+ || | | || | Pattern: | || | +------------------------------------------------------+ | || | | [ip, user, timestamp, request, status_code, size] | | || | | | | || | | Filter: [status_code=404] | | || | | Metric: 404Errors, Value: 1 | | || | +------------------------------------------------------+ | || | | || | Use Cases: | || | +------------------------------------------------------+ | || | | - Count HTTP errors | | || | | - Track specific events | | || | | - Create alarms from logs | | || | +------------------------------------------------------+ | || | | || +------------------------------------------------------------+ || |+------------------------------------------------------------------+36.6 CLI Commands
Section titled “36.6 CLI Commands”# List metricsaws cloudwatch list-metrics \ --namespace AWS/EC2
# Get metric statisticsaws cloudwatch get-metric-statistics \ --namespace AWS/EC2 \ --metric-name CPUUtilization \ --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \ --statistics Average \ --period 300 \ --start-time 2024-01-15T00:00:00Z \ --end-time 2024-01-15T23:59:59Z
# Put metric data (custom metric)aws cloudwatch put-metric-data \ --namespace MyApplication \ --metric-name RequestCount \ --value 100 \ --unit Count
# Put metric data with dimensionsaws cloudwatch put-metric-data \ --namespace MyApplication \ --metric-name RequestLatency \ --dimensions Server=Web-01,Region=us-east-1 \ --value 50 \ --unit Milliseconds
# Create alarmaws cloudwatch put-metric-alarm \ --alarm-name "HighCPU" \ --alarm-description "Alarm when CPU exceeds 80%" \ --metric-name CPUUtilization \ --namespace AWS/EC2 \ --statistic Average \ --period 300 \ --threshold 80 \ --comparison-operator GreaterThanThreshold \ --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \ --evaluation-periods 2 \ --alarm-actions arn:aws:sns:us-east-1:123456789012:MyTopic
# Describe alarmsaws cloudwatch describe-alarms
# Set alarm state (for testing)aws cloudwatch set-alarm-state \ --alarm-name "HighCPU" \ --state-value ALARM \ --state-reason "Testing alarm"
# Create log groupaws logs create-log-group \ --log-group-name /my-application/logs
# Create log streamaws logs create-log-stream \ --log-group-name /my-application/logs \ --log-stream-name stream-1
# Put log eventsaws logs put-log-events \ --log-group-name /my-application/logs \ --log-stream-name stream-1 \ --log-events timestamp=1705312800000,message="Application started"
# Create metric filteraws logs put-metric-filter \ --log-group-name /my-application/logs \ --filter-name ErrorCount \ --filter-pattern "[timestamp, level=ERROR, ...]" \ --metric-transformations metricName=ErrorCount,metricNamespace=MyApp,metricValue=1
# Start CloudWatch Insights queryaws logs start-query \ --log-group-name /aws/lambda/my-function \ --query-string "fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20"36.7 Best Practices
Section titled “36.7 Best Practices”CloudWatch Best Practices
Section titled “CloudWatch Best Practices” CloudWatch Best Practices+------------------------------------------------------------------+| || 1. Use namespaces for organization || +------------------------------------------------------------+ || | - Group related metrics together | || | - Use consistent naming conventions | || +------------------------------------------------------------+ || || 2. Set appropriate retention periods || +------------------------------------------------------------+ || | - Balance cost vs. compliance requirements | || | - Use shorter retention for non-critical logs | || +------------------------------------------------------------+ || || 3. Use composite alarms for complex conditions || +------------------------------------------------------------+ || | - Reduce false positives | || | - Create meaningful alerting logic | || +------------------------------------------------------------+ || || 4. Enable anomaly detection || +------------------------------------------------------------+ || | - Automatic detection of anomalies | || | - No need to set static thresholds | || +------------------------------------------------------------+ || || 5. Use dashboards for visibility || +------------------------------------------------------------+ || | - Create operational dashboards | || | - Share with stakeholders | || +------------------------------------------------------------+ || |+------------------------------------------------------------------+36.8 Exam Tips
Section titled “36.8 Exam Tips” Key Exam Points+------------------------------------------------------------------+| || 1. Standard metrics are free, custom metrics cost money || || 2. Alarms have 3 states: OK, ALARM, INSUFFICIENT_DATA || || 3. Metric resolution: Standard (1 min) vs High (1 sec) || || 4. Logs are stored in log groups, streams within groups || || 5. Metric filters create metrics from log patterns || || 6. Composite alarms combine multiple alarms with logic || || 7. Anomaly detection uses ML for dynamic thresholds || || 8. Dashboards can be shared externally || || 9. CloudWatch Agent collects system-level metrics || || 10. Alarms can trigger Auto Scaling, SNS, EC2 actions || |+------------------------------------------------------------------+36.9 Summary
Section titled “36.9 Summary” Chapter 36 Summary+------------------------------------------------------------------+| || CloudWatch Metrics || +------------------------------------------------------------+ || | - Standard metrics from AWS services | || | - Custom metrics from applications | || | - Statistics: avg, min, max, sum, percentiles | || +------------------------------------------------------------+ || || CloudWatch Alarms || +------------------------------------------------------------+ || | - Threshold-based monitoring | || | - Actions: SNS, Auto Scaling, EC2 | || | - Composite alarms for complex logic | || +------------------------------------------------------------+ || || CloudWatch Logs || +------------------------------------------------------------+ || | - Log groups and streams | || | - Metric filters for log-to-metric conversion | || | - Insights for log analysis | || +------------------------------------------------------------+ || |+------------------------------------------------------------------+Next Chapter: Chapter 37: AWS CloudTrail - API Auditing