Skip to content

Health

Maintaining node health and uptime is critical for blockchain infrastructure. Health checks ensure your node is functioning correctly, responding to requests, and staying synchronized with the network. This chapter covers comprehensive health monitoring strategies, including built-in endpoints, Prometheus metrics, alerting rules, and automated recovery procedures.


Terminal window
# Method 1: Using eth_blockNumber (basic)
curl -X POST http://localhost:8545 \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}'
# Response:
# {"jsonrpc":"2.0","id":1,"result":"0x10e7a2f"}
# Method 2: Using web3_clientVersion
curl -X POST http://localhost:8545 \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"web3_clientVersion","params":[],"id":1}'
# Method 3: Check syncing status
curl -X POST http://localhost:8545 \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}'
# Returns false if synced, or sync progress object
Terminal window
# Prometheus metrics (must be enabled)
# Start geth with --metrics --metrics.addr 0.0.0.0 --metrics.port 6060
# Access metrics
curl http://localhost:6060/metrics
# Common metrics:
# - go_goroutines (number of goroutines)
# - go_memstats_alloc_bytes (memory usage)
# - eth_block_header (current block)
# - eth_sync_known_states (sync progress)
check_health.sh
#!/bin/bash
RPC_URL="http://localhost:8545"
TIMEOUT=5
# Check RPC responsiveness
response=$(curl -s -m $TIMEOUT -w "%{http_code}" -o /dev/null $RPC_URL \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}')
if [ "$response" = "200" ]; then
# Check if synced
sync_status=$(curl -s -m $TIMEOUT $RPC_URL \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}' \
| jq -r '.result')
if [ "$sync_status" = "false" ]; then
echo "OK: Node is healthy and synced"
exit 0
else
echo "WARNING: Node is syncing"
exit 1
fi
else
echo "CRITICAL: Node not responding"
exit 2
fi

/etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'geth'
static_configs:
- targets: ['localhost:6060']
metrics_path: /metrics
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
MetricDescriptionAlert Threshold
eth_block_headerCurrent block number< network block - 20
eth_syncingSync progressAlways true when syncing
go_goroutinesGoroutine count> 10000
go_memstats_alloc_bytesMemory allocation> 80% available
p2p_peersConnected peers< 5
{
"dashboard": {
"title": "Ethereum Node Health",
"panels": [
{
"title": "Block Height",
"type": "graph",
"targets": [
{
"expr": "eth_block_header",
"legendFormat": "Current"
},
{
"expr": "eth_network_block",
"legendFormat": "Network"
}
]
},
{
"title": "Peer Count",
"type": "graph",
"targets": [
{
"expr": "p2p_peers",
"legendFormat": "Peers"
}
]
},
{
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "go_memstats_alloc_bytes",
"legendFormat": "Allocated"
}
]
}
]
}
}

/etc/prometheus/alerts.yml
groups:
- name: ethereum
rules:
# Node is down
- alert: NodeDown
expr: up{job="geth"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Ethereum node is down"
description: "Node {{ $labels.instance }} has been down for 5 minutes"
# Node not synced
- alert: NodeNotSynced
expr: (eth_network_block - eth_block_header) > 20
for: 10m
labels:
severity: warning
annotations:
summary: "Node is behind network"
description: "Node is {{ $value }} blocks behind network"
# Too few peers
- alert: FewPeers
expr: p2p_peers < 5
for: 5m
labels:
severity: warning
annotations:
summary: "Low peer count"
description: "Node has only {{ $value }} peers"
# High memory usage
- alert: HighMemoryUsage
expr: (go_memstats_alloc_bytes / 16GB) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage above 90%"
# Goroutine leak
- alert: HighGoroutines
expr: go_goroutines > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "High goroutine count"
description: "Node has {{ $value }} goroutines"
# Sync stalled
- alert: SyncStalled
expr: rate(eth_block_header[10m]) == 0
for: 15m
labels:
severity: critical
annotations:
summary: "Sync has stalled"
description: "No new blocks in 15 minutes"
/etc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
continue: true
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://localhost:5001/webhook'
- name: 'critical-alerts'
webhook_configs:
- url: 'http://localhost:5001/critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
- name: 'warning-alerts'
webhook_configs:
- url: 'http://localhost:5001/warning'

ServiceFeaturesCost
UptimeRobotHTTP checks, 5 min intervalFree tier available
PingdomAdvanced checks, SMS alertsPaid
DatadogFull monitoring suitePaid
Grafana CloudMetrics + logs + alertsPaid
CustomPrometheus + AlertManagerSelf-hosted
/usr/local/bin/health_check
# Create health check script that returns 200
#!/bin/bash
response=$(curl -s -o /dev/null -w "%{http_code}" \
-X POST http://localhost:8545 \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}')
if [ "$response" = "200" ]; then
exit 0
else
exit 1
fi
docker-compose.yml
services:
geth:
image: ethereum/client-go:latest
healthcheck:
test: ["CMD", "curl", "-f", "-X", "POST", "http://localhost:8545",
"-H", "Content-Type: application/json",
"-d", '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}']
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
deploy:
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3

/etc/systemd/system/geth.service
[Unit]
Description=Ethereum Geth Node
After=network.target
[Service]
Type=simple
User=ethereum
ExecStart=/usr/local/bin/geth --config /etc/geth/config.toml
Restart=always
RestartSec=10
RestartSteps=10
RestartInterval=60
# Memory limits
MemoryMax=32G
MemoryHigh=24G
# Watchdog
WatchdogSec=300
[Install]
WantedBy=multi-user.target
auto_restart.sh
#!/bin/bash
LOG_FILE="/var/log/node_restart.log"
# Check if node is healthy
if ! curl -s -f -X POST http://localhost:8545 \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
> /dev/null 2>&1; then
echo "$(date): Node unhealthy, attempting restart..." >> $LOG_FILE
# Try to restart
systemctl restart geth
# Wait for restart
sleep 30
# Check again
if curl -s -f -X POST http://localhost:8545 \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
> /dev/null 2>&1; then
echo "$(date): Node recovered" >> $LOG_FILE
else
echo "$(date): Node failed to restart, alerting..." >> $LOG_FILE
# Send alert (configure your alerting system)
curl -X POST https://hooks.example.com/alert \
-d '{"message":"Node failed to restart"}'
fi
fi

┌─────────────────────────────────────────────────────────────┐
│ NODE HEALTH DASHBOARD │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Block Height │ │ Sync Status │ │
│ │ ████████████░░ │ │ ✓ Synced │ │
│ │ 18,500,000 │ │ 5 min ago │ │
│ └──────────────────┘ └──────────────────┘ │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Peer Count │ │ Memory Usage │ │
│ │ ██████████░░░░ │ │ ████████░░░░░░ │ │
│ │ 25 peers │ │ 8.2 GB / 16 GB │ │
│ └──────────────────┘ └──────────────────┘ │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ CPU Usage │ │ Disk I/O │ │
│ │ ██████░░░░░░░ │ │ ████░░░░░░░░░░ │ │
│ │ 45% │ │ 45 MB/s │ │
│ └──────────────────┘ └──────────────────┘ │
│ │
│ Recent Alerts: │
│ - ✓ No recent alerts │
│ │
│ Uptime: 99.98% (last 30 days) │
│ │
└─────────────────────────────────────────────────────────────┘

  • Enable metrics endpoint (--metrics)
  • Configure Prometheus to scrape metrics
  • Set up Grafana dashboard
  • Configure alert rules for all critical metrics
  • Set up external uptime monitoring
  • Configure auto-restart on failure
  • Test recovery procedures
  • Document runbooks for each alert
  • Review logs daily
  • Schedule regular health checks
# Alert: NodeDown
## Symptoms
- Prometheus reports `up{job="geth"} == 0`
- External monitoring shows node unreachable
## Impact
- No RPC service available
- Potential missed attestations (validator)
## Diagnosis
1. Check if process is running: `systemctl status geth`
2. Check logs: `journalctl -u geth -n 100`
3. Check system resources: `top`, `df -h`
## Resolution
1. If process crashed: `systemctl restart geth`
2. If OOM: Increase memory or reduce cache
3. If disk full: Clear old data or expand storage
4. If port conflict: Check for other processes
## Prevention
- Monitor memory and disk usage
- Set appropriate resource limits

  • Health checks are essential for production nodes
  • Use Prometheus + Grafana for comprehensive monitoring
  • Configure alerts for all critical metrics
  • Set up external monitoring for redundancy
  • Implement auto-restart for common failures
  • Create runbooks for each alert type
  • Test recovery procedures regularly
  • Maintain 99%+ uptime with proper monitoring

In Chapter 41: Node Software Upgrades, we’ll cover upgrade procedures.


Last Updated: 2026-02-22