Skip to content

Cloudwatch

Chapter 36: Amazon CloudWatch - Metrics & Alarms

Section titled “Chapter 36: Amazon CloudWatch - Metrics & Alarms”

Amazon CloudWatch is a monitoring and observability service that provides data and actionable insights for AWS, hybrid, and on-premises applications.

Amazon CloudWatch Overview
+------------------------------------------------------------------+
| |
| +------------------------+ |
| | Amazon CloudWatch | |
| +------------------------+ |
| | |
| +---------------------+---------------------+ |
| | | | | |
| v v v v |
| +----------+ +----------+ +----------+ +----------+ |
| | Metrics | | Alarms | | Logs | | Events | |
| | | | | | | | | |
| | - System | | - Thresh | | - Stream | | - Rules | |
| | - Custom | | - Actions| | - Query | | - Target | |
| | - Dash | | - Compos | | - Insights| | - Schedule| |
| +----------+ +----------+ +----------+ +----------+ |
| |
+------------------------------------------------------------------+
FeatureDescription
MetricsSystem and custom metrics
AlarmsThreshold-based notifications
LogsLog aggregation and analysis
EventsEvent-driven automation
DashboardsVisual monitoring
ServiceLensEnd-to-end tracing

CloudWatch Metric Types
+------------------------------------------------------------------+
| |
| Standard Metrics (AWS Services) |
| +------------------------------------------------------------+ |
| | | |
| | EC2 Metrics: | |
| | +------------------------------------------------------+ | |
| | | - CPUUtilization | | |
| | | - NetworkIn/NetworkOut | | |
| | | - DiskReadOps/DiskWriteOps | | |
| | | - StatusCheckFailed | | |
| | +------------------------------------------------------+ | |
| | | |
| | RDS Metrics: | |
| | +------------------------------------------------------+ | |
| | | - CPUUtilization | | |
| | | - FreeStorageSpace | | |
| | | - DatabaseConnections | | |
| | | - ReadIOPS/WriteIOPS | | |
| | +------------------------------------------------------+ | |
| | | |
| | Lambda Metrics: | |
| | +------------------------------------------------------+ | |
| | | - Invocations | | |
| | | - Duration | | |
| | | - Errors | | |
| | | - Throttles | | |
| | +------------------------------------------------------+ | |
| | | |
| +------------------------------------------------------------+ |
| |
| Custom Metrics |
| +------------------------------------------------------------+ |
| | | |
| | Sources: | |
| | +------------------------------------------------------+ | |
| | | - CloudWatch Agent (memory, disk, processes) | | |
| | | - Application code (PutMetricData API) | | |
| | | - CloudWatch Embedded Metric Format (EMF) | | |
| | | - StatsD, collectd | | |
| | +------------------------------------------------------+ | |
| | | |
| +------------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
CloudWatch Metric Structure
+------------------------------------------------------------------+
| |
| Metric |
| +------------------------------------------------------------+ |
| | | |
| | Namespace: AWS/EC2 (or custom namespace) | |
| | MetricName: CPUUtilization | |
| | Dimensions: | |
| | - InstanceId: i-1234567890abcdef0 | |
| | - InstanceType: t3.micro | |
| | Timestamp: 2024-01-15T12:00:00Z | |
| | Value: 75.5 | |
| | Unit: Percent | |
| | | |
| +------------------------------------------------------------+ |
| |
| Dimensions |
| +------------------------------------------------------------+ |
| | | |
| | - Key-value pairs for filtering | |
| | - Up to 10 dimensions per metric | |
| | - Examples: InstanceId, ServiceName, FunctionName | |
| | | |
| +------------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
CloudWatch Metric Statistics
+------------------------------------------------------------------+
| |
| Statistics |
| +------------------------------------------------------------+ |
| | | |
| | Basic Statistics: | |
| | +------------------------------------------------------+ | |
| | | - Average (avg) | | |
| | | - Minimum (min) | | |
| | | - Maximum (max) | | |
| | | - SampleCount | | |
| | | - Sum | | |
| | +------------------------------------------------------+ | |
| | | |
| | Extended Statistics (Percentiles): | |
| | +------------------------------------------------------+ | |
| | | - p50 (median) | | |
| | | - p90, p95, p99 | | |
| | | - TM (trimmed mean) | | |
| | | - TS (trimmed sum) | | |
| | +------------------------------------------------------+ | |
| | | |
| +------------------------------------------------------------+ |
| |
| Periods |
| +------------------------------------------------------------+ |
| | | |
| | - 1 minute (high-resolution metrics) | |
| | - 5 minutes (standard) | |
| | - 15 minutes (detailed) | |
| | - 1 hour, 1 day | |
| | | |
| +------------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

CloudWatch Alarm States
+------------------------------------------------------------------+
| |
| +------------------------+ |
| | Alarm States | |
| +------------------------+ |
| | |
| +---------------------+---------------------+ |
| | | | |
| v v v |
| +----------+ +----------+ +----------+ |
| | OK | | ALARM | |INSUFFICIENT |
| | | | | | DATA | |
| | - Normal | | - Thresh | | - Not | |
| | State | | Exceed | | enough | |
| | | | | | data | |
| +----------+ +----------+ +----------+ |
| |
+------------------------------------------------------------------+
CloudWatch Alarm Configuration
+------------------------------------------------------------------+
| |
| Alarm Components |
| +------------------------------------------------------------+ |
| | | |
| | Metric: | |
| | +------------------------------------------------------+ | |
| | | - Namespace, MetricName, Dimensions | | |
| | +------------------------------------------------------+ | |
| | | |
| | Threshold: | |
| | +------------------------------------------------------+ | |
| | | - ComparisonOperator (>=, <=, >, <, etc.) | | |
| | | - Threshold value | | |
| | | - EvaluationPeriods (number of periods) | | |
| | | - DatapointsToAlarm (breaching datapoints) | | |
| | +------------------------------------------------------+ | |
| | | |
| | Actions: | |
| | +------------------------------------------------------+ | |
| | | - AlarmActions (when ALARM state) | | |
| | | - OKActions (when OK state) | | |
| | | - InsufficientDataActions | | |
| | +------------------------------------------------------+ | |
| | | |
| +------------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
CloudWatch Alarm Actions
+------------------------------------------------------------------+
| |
| Notification Actions |
| +------------------------------------------------------------+ |
| | - SNS notifications | |
| | - Email, SMS, HTTP endpoints | |
| +------------------------------------------------------------+ |
| |
| Auto Scaling Actions |
| +------------------------------------------------------------+ |
| | - ScaleOut (add instances) | |
| | - ScaleIn (remove instances) | |
| +------------------------------------------------------------+ |
| |
| EC2 Actions |
| +------------------------------------------------------------+ |
| | - Stop, Terminate, Reboot, Recover instances | |
| +------------------------------------------------------------+ |
| |
| Systems Manager Actions |
| +------------------------------------------------------------+ |
| | - Start Automation document | |
| +------------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Composite Alarms
+------------------------------------------------------------------+
| |
| Combine multiple alarms with logic |
| +------------------------------------------------------------+ |
| | | |
| | Example: | |
| | +------------------------------------------------------+ | |
| | | ALARM(cpu_high) AND ALARM(memory_high) | | |
| | | | | |
| | | - Trigger only when both conditions are met | | |
| | | - Reduce false positives | | |
| | | - Complex monitoring scenarios | | |
| | +------------------------------------------------------+ | |
| | | |
| | Operators: | |
| | +------------------------------------------------------+ | |
| | | - AND (all conditions must be true) | | |
| | | - OR (any condition must be true) | | |
| | | - NOT (negate condition) | | |
| | +------------------------------------------------------+ | |
| | | |
| +------------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

CloudWatch Dashboards
+------------------------------------------------------------------+
| |
| Dashboard Features |
| +------------------------------------------------------------+ |
| | | |
| | Widgets: | |
| | +------------------------------------------------------+ | |
| | | - Line charts | | |
| | | - Stacked area charts | | |
| | | - Number widgets | | |
| | | - Gauge widgets | | |
| | | - Text widgets | | |
| | | - Log table widgets | | |
| | | - Alarm status widgets | | |
| | +------------------------------------------------------+ | |
| | | |
| | Sharing: | |
| | +------------------------------------------------------+ | |
| | | - Share dashboards externally | | |
| | | - Embed in external websites | | |
| | | - Cross-account dashboards | | |
| | +------------------------------------------------------+ | |
| | | |
| +------------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

CloudWatch Logs Architecture
+------------------------------------------------------------------+
| |
| +------------------------+ |
| | CloudWatch Logs | |
| +------------------------+ |
| | |
| +---------------------+---------------------+ |
| | | | | |
| v v v v |
| +----------+ +----------+ +----------+ +----------+ |
| | Log | | Log | | Metric | | Insights | |
| | Groups | | Streams | | Filters | | | |
| | | | | | | | | |
| | - Retent | | - Source | | - Pattern| | - Query | |
| | - Encrypt| | - Seq | | - Metric | | - Analyze| |
| +----------+ +----------+ +----------+ +----------+ |
| |
+------------------------------------------------------------------+
CloudWatch Logs Structure
+------------------------------------------------------------------+
| |
| Log Group |
| +------------------------------------------------------------+ |
| | | |
| | /aws/lambda/my-function | |
| | +--------------------------------------------------------+ | |
| | | Log Stream: 2024-01-15/[$LATEST]abc123 | | |
| | | Log Stream: 2024-01-15/[$LATEST]def456 | | |
| | +--------------------------------------------------------+ | |
| | | |
| | Configuration: | |
| | +------------------------------------------------------+ | |
| | | - Retention (1 day to 10 years, or never expire) | | |
| | | - Encryption (KMS) | | |
| | | - Subscription filters | | |
| | +------------------------------------------------------+ | |
| | | |
| +------------------------------------------------------------+ |
| |
| Log Stream |
| +------------------------------------------------------------+ |
| | | |
| | - Sequence of log events | |
| | - Usually one per source (instance, function, etc.) | |
| | - Ordered by timestamp | |
| | | |
| +------------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
CloudWatch Metric Filters
+------------------------------------------------------------------+
| |
| Create metrics from log data |
| +------------------------------------------------------------+ |
| | | |
| | Pattern: | |
| | +------------------------------------------------------+ | |
| | | [ip, user, timestamp, request, status_code, size] | | |
| | | | | |
| | | Filter: [status_code=404] | | |
| | | Metric: 404Errors, Value: 1 | | |
| | +------------------------------------------------------+ | |
| | | |
| | Use Cases: | |
| | +------------------------------------------------------+ | |
| | | - Count HTTP errors | | |
| | | - Track specific events | | |
| | | - Create alarms from logs | | |
| | +------------------------------------------------------+ | |
| | | |
| +------------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

Terminal window
# List metrics
aws cloudwatch list-metrics \
--namespace AWS/EC2
# Get metric statistics
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--statistics Average \
--period 300 \
--start-time 2024-01-15T00:00:00Z \
--end-time 2024-01-15T23:59:59Z
# Put metric data (custom metric)
aws cloudwatch put-metric-data \
--namespace MyApplication \
--metric-name RequestCount \
--value 100 \
--unit Count
# Put metric data with dimensions
aws cloudwatch put-metric-data \
--namespace MyApplication \
--metric-name RequestLatency \
--dimensions Server=Web-01,Region=us-east-1 \
--value 50 \
--unit Milliseconds
# Create alarm
aws cloudwatch put-metric-alarm \
--alarm-name "HighCPU" \
--alarm-description "Alarm when CPU exceeds 80%" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:MyTopic
# Describe alarms
aws cloudwatch describe-alarms
# Set alarm state (for testing)
aws cloudwatch set-alarm-state \
--alarm-name "HighCPU" \
--state-value ALARM \
--state-reason "Testing alarm"
# Create log group
aws logs create-log-group \
--log-group-name /my-application/logs
# Create log stream
aws logs create-log-stream \
--log-group-name /my-application/logs \
--log-stream-name stream-1
# Put log events
aws logs put-log-events \
--log-group-name /my-application/logs \
--log-stream-name stream-1 \
--log-events timestamp=1705312800000,message="Application started"
# Create metric filter
aws logs put-metric-filter \
--log-group-name /my-application/logs \
--filter-name ErrorCount \
--filter-pattern "[timestamp, level=ERROR, ...]" \
--metric-transformations metricName=ErrorCount,metricNamespace=MyApp,metricValue=1
# Start CloudWatch Insights query
aws logs start-query \
--log-group-name /aws/lambda/my-function \
--query-string "fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20"

CloudWatch Best Practices
+------------------------------------------------------------------+
| |
| 1. Use namespaces for organization |
| +------------------------------------------------------------+ |
| | - Group related metrics together | |
| | - Use consistent naming conventions | |
| +------------------------------------------------------------+ |
| |
| 2. Set appropriate retention periods |
| +------------------------------------------------------------+ |
| | - Balance cost vs. compliance requirements | |
| | - Use shorter retention for non-critical logs | |
| +------------------------------------------------------------+ |
| |
| 3. Use composite alarms for complex conditions |
| +------------------------------------------------------------+ |
| | - Reduce false positives | |
| | - Create meaningful alerting logic | |
| +------------------------------------------------------------+ |
| |
| 4. Enable anomaly detection |
| +------------------------------------------------------------+ |
| | - Automatic detection of anomalies | |
| | - No need to set static thresholds | |
| +------------------------------------------------------------+ |
| |
| 5. Use dashboards for visibility |
| +------------------------------------------------------------+ |
| | - Create operational dashboards | |
| | - Share with stakeholders | |
| +------------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

Exam Tip

Key Exam Points
+------------------------------------------------------------------+
| |
| 1. Standard metrics are free, custom metrics cost money |
| |
| 2. Alarms have 3 states: OK, ALARM, INSUFFICIENT_DATA |
| |
| 3. Metric resolution: Standard (1 min) vs High (1 sec) |
| |
| 4. Logs are stored in log groups, streams within groups |
| |
| 5. Metric filters create metrics from log patterns |
| |
| 6. Composite alarms combine multiple alarms with logic |
| |
| 7. Anomaly detection uses ML for dynamic thresholds |
| |
| 8. Dashboards can be shared externally |
| |
| 9. CloudWatch Agent collects system-level metrics |
| |
| 10. Alarms can trigger Auto Scaling, SNS, EC2 actions |
| |
+------------------------------------------------------------------+

Chapter 36 Summary
+------------------------------------------------------------------+
| |
| CloudWatch Metrics |
| +------------------------------------------------------------+ |
| | - Standard metrics from AWS services | |
| | - Custom metrics from applications | |
| | - Statistics: avg, min, max, sum, percentiles | |
| +------------------------------------------------------------+ |
| |
| CloudWatch Alarms |
| +------------------------------------------------------------+ |
| | - Threshold-based monitoring | |
| | - Actions: SNS, Auto Scaling, EC2 | |
| | - Composite alarms for complex logic | |
| +------------------------------------------------------------+ |
| |
| CloudWatch Logs |
| +------------------------------------------------------------+ |
| | - Log groups and streams | |
| | - Metric filters for log-to-metric conversion | |
| | - Insights for log analysis | |
| +------------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

Next Chapter: Chapter 37: AWS CloudTrail - API Auditing