Skip to content

Amazon ElastiCache

Chapter 24: Amazon ElastiCache - In-Memory Caching

Section titled “Chapter 24: Amazon ElastiCache - In-Memory Caching”

Amazon ElastiCache is a fully managed in-memory data store service, compatible with Redis and Memcached.

ElastiCache Overview
+------------------------------------------------------------------+
| |
| +------------------------+ |
| | ElastiCache | |
| +------------------------+ |
| | |
| +---------------------+---------------------+ |
| | | | |
| v v v |
| +----------+ +----------+ +----------+ |
| | Redis | | Memcached| | Managed | |
| | | | | | Service | |
| | - Multi-AZ| | - Simple | | | |
| | - Cluster| | - Scale | | - Setup | |
| | - Replication| | out | | - Patch | |
| | - Persistence| | - Cache | | - Monitor| |
| +----------+ +----------+ +----------+ |
| |
| Redis: Advanced features, persistence, replication |
| Memcached: Simple caching, multi-threaded |
| |
+------------------------------------------------------------------+

Redis vs Memcached Comparison
+------------------------------------------------------------------+
| |
| Feature | Redis | Memcached |
| -----------------|--------------------|------------------------|
| Data Structures | Strings, Lists, | Simple key-value |
| | | Sets, Hashes, etc. | |
| -----------------|--------------------|------------------------|
| Persistence | Yes (AOF, RDB) | No |
| -----------------|--------------------|------------------------|
| Replication | Yes (Primary/Replica)| No |
| -----------------|--------------------|------------------------|
| Multi-AZ | Yes | No |
| -----------------|--------------------|------------------------|
| Clustering | Yes (sharding) | Yes (distributed) |
| -----------------|--------------------|------------------------|
| Transactions | Yes | No |
| -----------------|--------------------|------------------------|
| Pub/Sub | Yes | No |
| -----------------|--------------------|------------------------|
| Sorting | Yes | No |
| -----------------|--------------------|------------------------|
| Threading | Single-threaded | Multi-threaded |
| -----------------|--------------------|------------------------|
| Use Case | Complex data, | Simple caching, |
| | | leaderboards, | session store |
| | | session store | |
| |
+------------------------------------------------------------------+

Redis Cluster Mode Disabled
+------------------------------------------------------------------+
| |
| Single Node |
| +----------------------------------------------------------+ |
| | | |
| | +------------------+ | |
| | | Primary Node | | |
| | | (Read/Write) | | |
| | +------------------+ | |
| | | |
| +----------------------------------------------------------+ |
| |
| Primary-Replica (No Cluster) |
| +----------------------------------------------------------+ |
| | | |
| | AZ A AZ B | |
| | +------------------+ +------------------+ | |
| | | Primary Node | | Replica Node | | |
| | | (Read/Write) | | (Read Only) | | |
| | +------------------+ +------------------+ | |
| | | ^ | |
| | +----Replication-------+ | |
| | | |
| | Features: | |
| | - Single shard | |
| | - Up to 5 replicas | |
| | - Automatic failover | |
| | - Data size: Up to node capacity | |
| | | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Redis Cluster Mode Enabled
+------------------------------------------------------------------+
| |
| Sharded Cluster |
| +----------------------------------------------------------+ |
| | | |
| | Shard 1 Shard 2 Shard 3 | |
| | +----------------+ +----------------+ +--------+ | |
| | | Primary Replica| | Primary Replica| | Primary| | |
| | | Node 1 Node 1 | | Node 2 Node 2 | | Node 3 | | |
| | | (AZ-a) (AZ-b) | | (AZ-a) (AZ-b) | | (AZ-a) | | |
| | +----------------+ +----------------+ +--------+ | |
| | | | | | |
| | v v v | |
| | Slot 0-5460 Slot 5461-10922 Slot 10923-16383
| | | |
| | Features: | |
| | - Up to 500 shards | |
| | - Data partitioned across shards | |
| | - 16384 hash slots | |
| | - Automatic failover per shard | |
| | - Scale out by adding shards | |
| | | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

Memcached Architecture
+------------------------------------------------------------------+
| |
| Memcached Cluster |
| +----------------------------------------------------------+ |
| | | |
| | Application | |
| | +------------------+ | |
| | | | | |
| | +------------------+ | |
| | | | |
| | v | |
| | Configuration Endpoint | |
| | +------------------+ | |
| | | mycache.cfg.cache| | |
| | | .amazonaws.com | | |
| | +------------------+ | |
| | | | |
| | +----+----+ | |
| | | | | | |
| | v v v | |
| | +----++----++----+ | |
| | |Node||Node||Node| | |
| | | 1 || 2 || 3 | | |
| | +----++----++----+ | |
| | | |
| | Features: | |
| | - No replication | |
| | - Each node independent | |
| | - Client-side sharding | |
| | - Auto-discovery via config endpoint | |
| | | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

Lazy Loading Pattern
+------------------------------------------------------------------+
| |
| Read Flow: |
| +----------------------------------------------------------+ |
| | | |
| | Application | |
| | | | |
| | v | |
| | +----------+ | |
| | | Check | | |
| | | Cache | | |
| | +----------+ | |
| | | | |
| | +----+----+ | |
| | | | | |
| | v v | |
| | HIT MISS | |
| | | | | |
| | v v | |
| | Return +----------+ | |
| | Data | Query | | |
| | | Database| | |
| | +----------+ | |
| | | | |
| | v | |
| | +----------+ | |
| | | Write to | | |
| | | Cache | | |
| | +----------+ | |
| | | | |
| | v | |
| | Return Data | |
| | | |
| +----------------------------------------------------------+ |
| |
| Pros: Only requested data cached |
| Cons: Cache miss penalty, stale data possible |
| |
+------------------------------------------------------------------+
Write-Through Pattern
+------------------------------------------------------------------+
| |
| Write Flow: |
| +----------------------------------------------------------+ |
| | | |
| | Application | |
| | | | |
| | v | |
| | +----------+ | |
| | | Write to | | |
| | | Cache | | |
| | +----------+ | |
| | | | |
| | v | |
| | +----------+ | |
| | | Write to | | |
| | | Database | | |
| | +----------+ | |
| | | | |
| | v | |
| | Return Success | |
| | | |
| +----------------------------------------------------------+ |
| |
| Pros: Cache always fresh |
| Cons: Write latency, wasted cache for unread data |
| |
+------------------------------------------------------------------+
Write-Behind Pattern
+------------------------------------------------------------------+
| |
| Write Flow: |
| +----------------------------------------------------------+ |
| | | |
| | Application | |
| | | | |
| | v | |
| | +----------+ | |
| | | Write to | | |
| | | Cache | | |
| | +----------+ | |
| | | | |
| | v | |
| | Return Success (immediate) | |
| | | | |
| | v (async) | |
| | +----------+ | |
| | | Write to | | |
| | | Database | | |
| | | (async) | | |
| | +----------+ | |
| | | |
| +----------------------------------------------------------+ |
| |
| Pros: Fast writes, reduced database load |
| Cons: Data loss risk, complexity |
| |
+------------------------------------------------------------------+

Redis Data Structures
+------------------------------------------------------------------+
| |
| Strings |
| +----------------------------------------------------------+ |
| | SET key value | |
| | GET key | |
| | INCR counter | |
| | Use: Caching, counters, session data | |
| +----------------------------------------------------------+ |
| |
| Hashes |
| +----------------------------------------------------------+ |
| | HSET user:1 name "John" email "john@ex.com" | |
| | HGET user:1 name | |
| | HGETALL user:1 | |
| | Use: User profiles, product info | |
| +----------------------------------------------------------+ |
| |
| Lists |
| +----------------------------------------------------------+ |
| | LPUSH queue task1 | |
| | RPOP queue | |
| | LRANGE queue 0 -1 | |
| | Use: Message queues, activity feeds | |
| +----------------------------------------------------------+ |
| |
| Sets |
| +----------------------------------------------------------+ |
| | SADD tags "redis" "database" | |
| | SMEMBERS tags | |
| | SINTER set1 set2 | |
| | Use: Tags, unique items, social graphs | |
| +----------------------------------------------------------+ |
| |
| Sorted Sets |
| +----------------------------------------------------------+ |
| | ZADD leaderboard 100 "player1" | |
| | ZRANGE leaderboard 0 -1 WITHSCORES | |
| | ZREVRANK leaderboard "player1" | |
| | Use: Leaderboards, rankings, rate limiting | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Redis Persistence
+------------------------------------------------------------------+
| |
| RDB (Redis Database) |
| +----------------------------------------------------------+ |
| | | |
| | Features: | |
| | - Point-in-time snapshots | |
| | - Compact file | |
| | - Faster recovery | |
| | | |
| | Configuration: | |
| | save 900 1 # Save after 900s if >= 1 changes | |
| | save 300 10 # Save after 300s if >= 10 changes | |
| | save 60 10000 # Save after 60s if >= 10000 changes | |
| | | |
| +----------------------------------------------------------+ |
| |
| AOF (Append Only File) |
| +----------------------------------------------------------+ |
| | | |
| | Features: | |
| | - Logs every write operation | |
| | - Higher durability | |
| | - Larger file size | |
| | | |
| | Configuration: | |
| | appendonly yes | |
| | appendfsync everysec # Sync every second | |
| | appendfsync always # Sync every write (slowest) | |
| | | |
| +----------------------------------------------------------+ |
| |
| Recommended: Enable both RDB and AOF |
| |
+------------------------------------------------------------------+

# ============================================================
# ElastiCache Subnet Group
# ============================================================
resource "aws_elasticache_subnet_group" "main" {
name = "main-cache-subnet"
subnet_ids = var.private_subnet_ids
tags = {
Name = "main-cache-subnet-group"
}
}
# ============================================================
# ElastiCache Parameter Group
# ============================================================
resource "aws_elasticache_parameter_group" "redis" {
name = "redis-params"
family = "redis7"
parameter {
name = "maxmemory-policy"
value = "allkeys-lru"
}
parameter {
name = "timeout"
value = "300"
}
tags = {
Name = "redis-parameter-group"
}
}
# ============================================================
# Redis Cluster Mode Disabled
# ============================================================
resource "aws_elasticache_replication_group" "redis" {
replication_group_id = "main-redis"
replication_group_description = "Main Redis cluster"
# Engine
engine = "redis"
engine_version = "7.0"
parameter_group_name = aws_elasticache_parameter_group.redis.name
# Node type
node_type = "cache.r6g.large"
# Cluster mode disabled
num_cache_clusters = 2 # Primary + 1 replica
# Network
subnet_group_name = aws_elasticache_subnet_group.main.name
security_group_ids = [aws_security_group.redis.id]
# Availability
multi_az_enabled = true
automatic_failover_enabled = true
# Encryption
at_rest_encryption_enabled = true
transit_encryption_enabled = true
auth_token = var.redis_password
# Snapshot
snapshot_retention_limit = 7
snapshot_window = "03:00-05:00"
# Maintenance
maintenance_window = "Mon:05:00-Mon:07:00"
tags = {
Name = "main-redis"
}
}
# ============================================================
# Redis Cluster Mode Enabled
# ============================================================
resource "aws_elasticache_replication_group" "cluster" {
replication_group_id = "cluster-redis"
replication_group_description = "Redis cluster mode enabled"
engine = "redis"
engine_version = "7.0"
parameter_group_name = aws_elasticache_parameter_group.redis.name
node_type = "cache.r6g.large"
# Cluster mode enabled
cluster_mode {
replicas_per_node_group = 1
num_node_groups = 3 # 3 shards
}
# Network
subnet_group_name = aws_elasticache_subnet_group.main.name
security_group_ids = [aws_security_group.redis.id]
# Availability
automatic_failover_enabled = true
multi_az_enabled = true
# Encryption
at_rest_encryption_enabled = true
transit_encryption_enabled = true
auth_token = var.redis_password
tags = {
Name = "cluster-redis"
}
}
# ============================================================
# Memcached Cluster
# ============================================================
resource "aws_elasticache_cluster" "memcached" {
cluster_id = "main-memcached"
engine = "memcached"
engine_version = "1.6.22"
node_type = "cache.r6g.large"
num_cache_nodes = 3
# Network
subnet_group_name = aws_elasticache_subnet_group.main.name
security_group_ids = [aws_security_group.memcached.id]
# Parameter group
parameter_group_name = "default.memcached1.6"
# Maintenance
maintenance_window = "Mon:05:00-Mon:07:00"
tags = {
Name = "main-memcached"
}
}
# ============================================================
# Security Groups
# ============================================================
resource "aws_security_group" "redis" {
name = "redis-sg"
description = "Security group for Redis"
vpc_id = var.vpc_id
ingress {
description = "Redis from application"
from_port = 6379
to_port = 6379
protocol = "tcp"
security_groups = [aws_security_group.app.id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "redis-sg"
}
}
resource "aws_security_group" "memcached" {
name = "memcached-sg"
description = "Security group for Memcached"
vpc_id = var.vpc_id
ingress {
description = "Memcached from application"
from_port = 11211
to_port = 11211
protocol = "tcp"
security_groups = [aws_security_group.app.id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "memcached-sg"
}
}
# ============================================================
# Global Datastore (Redis)
# ============================================================
# Primary region
resource "aws_elasticache_replication_group" "global_primary" {
provider = aws.primary
replication_group_id = "global-redis"
replication_group_description = "Global Redis"
engine = "redis"
engine_version = "7.0"
node_type = "cache.r6g.large"
num_cache_clusters = 2
subnet_group_name = aws_elasticache_subnet_group.main.name
security_group_ids = [aws_security_group.redis.id]
automatic_failover_enabled = true
# Global datastore
global_replication_group_id = aws_elasticache_global_replication_group.main.id
}
# Secondary region
resource "aws_elasticache_replication_group" "global_secondary" {
provider = aws.secondary
replication_group_id = "global-redis-secondary"
replication_group_description = "Global Redis Secondary"
# Reference global replication group
global_replication_group_id = aws_elasticache_global_replication_group.main.global_replication_group_id
primary_cluster_id = aws_elasticache_replication_group.global_primary.primary_cluster_id
subnet_group_name = aws_elasticache_subnet_group.secondary.name
security_group_ids = [aws_security_group.redis_secondary.id]
}
resource "aws_elasticache_global_replication_group" "main" {
global_replication_group_id_suffix = "global"
primary_replication_group_id = aws_elasticache_replication_group.global_primary.id
}

ElastiCache Best Practices
+------------------------------------------------------------------+
| |
| 1. Instance Selection |
| +----------------------------------------------------------+ |
| | - Use cache-optimized instances (r6g family) | |
| | - Consider network bandwidth | |
| | - Right-size based on data volume | |
| +----------------------------------------------------------+ |
| |
| 2. Eviction Policy |
| +----------------------------------------------------------+ |
| | - allkeys-lru: Evict least recently used | |
| | - volatile-lru: Evict LRU among keys with TTL | |
| | - allkeys-random: Evict random keys | |
| | - noeviction: Return error when memory full | |
| +----------------------------------------------------------+ |
| |
| 3. Connection Management |
| +----------------------------------------------------------+ |
| | - Use connection pooling | |
| | - Set appropriate timeouts | |
| | - Handle connection failures gracefully | |
| +----------------------------------------------------------+ |
| |
| 4. Security |
| +----------------------------------------------------------+ |
| | - Enable encryption in-transit and at-rest | |
| | - Use AUTH password | |
| | - Deploy in private subnets | |
| | - Use security groups | |
| +----------------------------------------------------------+ |
| |
| 5. Monitoring |
| +----------------------------------------------------------+ |
| | - Monitor CPU, memory, connections | |
| | - Set up CloudWatch alarms | |
| | - Monitor cache hit ratio | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

Caching is the #1 performance optimization for most applications. SREs use ElastiCache to reduce database load, improve response times, and handle session management. Key operational concerns: eviction rate monitoring, cluster scaling, cache invalidation strategies, and Redis failover testing.


Terminal window
# Install tools
sudo pacman -S aws-cli-v2 jq redis
# === Cluster Status Dashboard ===
#!/bin/bash
# ~/bin/cache-status.sh
echo "=== Redis Clusters ==="
aws elasticache describe-replication-groups \
--query 'ReplicationGroups[*].{Name:ReplicationGroupId,Status:Status,Shards:NodeGroups|length(@),AutoFailover:AutomaticFailover}' \
--output table
echo ""
echo "=== Node Health ==="
aws elasticache describe-cache-clusters --show-cache-node-info \
--query 'CacheClusters[*].{Cluster:CacheClusterId,Engine:Engine,Status:CacheClusterStatus,NodeType:CacheNodeType,Nodes:NumCacheNodes}' \
--output table
# === Connect to Redis directly (from bastion/within VPC) ===
redis-cli -h my-redis.xxxx.cache.amazonaws.com -p 6379 \
--tls --askpass
# Useful Redis commands for SRE
redis-cli INFO memory # Memory usage
redis-cli INFO stats # Hit/miss ratio
redis-cli INFO replication # Replication status
redis-cli DBSIZE # Number of keys
redis-cli SLOWLOG GET 10 # Last 10 slow commands
redis-cli CLIENT LIST # Connected clients
# === Monitor cache hit ratio ===
aws cloudwatch get-metric-statistics \
--namespace AWS/ElastiCache \
--metric-name CacheHitRate \
--dimensions Name=ReplicationGroupId,Value=main-redis \
--start-time "$(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%S)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%S)" \
--period 300 --statistics Average --output table

IssueCauseSolution
High eviction rateMaxmemory reachedScale up node type or enable cluster mode for more shards
Low cache hit ratioWrong caching strategy or short TTLAnalyze access patterns, increase TTL, pre-warm cache
Connection refusedSecurity group or AUTH tokenVerify SG allows port 6379/11211, check auth token
Replication lag (Redis)Large write volumeMonitor ReplicationLag, consider scaling up node type
Failover causes app errorsApp not handling reconnectsImplement retry logic, use cluster endpoints

  1. Q: Lazy loading vs write-through — when to use each?

    • A: Lazy loading: data is cached only on read miss — good when most data is rarely read (avoids caching unused data). Write-through: cache is updated on every write — good when you can’t tolerate stale reads. Best practice: combine both — write-through for critical data (user sessions), lazy loading with TTL for less critical data (product catalog). Add TTL to both strategies to prevent indefinite staleness.
  2. Q: How do you handle a Redis failover with zero data loss?

    • A: Enable Multi-AZ with automatic failover. Redis replicates asynchronously, so a small window of data loss is possible during failover (typically <1s of writes). To minimize: (1) Use Multi-AZ, (2) Monitor ReplicationLag — should be near zero, (3) For critical data, also persist to a durable store (DynamoDB/RDS), (4) Use AOF with appendfsync everysec for best durability/performance balance.

Exam Tip

  1. Redis vs Memcached: Redis has persistence, replication, data structures
  2. Cluster Mode Disabled: Single shard, up to 5 replicas
  3. Cluster Mode Enabled: Up to 500 shards, data partitioned
  4. Lazy Loading: Cache on read, stale data possible
  5. Write-Through: Write to cache and DB, always fresh
  6. Eviction Policies: LRU, volatile-LRU, noeviction
  7. Persistence: RDB (snapshots), AOF (append-only)
  8. Multi-AZ: Automatic failover for Redis
  9. Encryption: At-rest and in-transit for Redis
  10. Global Datastore: Cross-region replication for Redis

Chapter 25: Other AWS Database Services


Last Updated: March 2026