Health
Chapter 40: Health Checks & Uptime
Section titled “Chapter 40: Health Checks & Uptime”Overview
Section titled “Overview”Maintaining node health and uptime is critical for blockchain infrastructure. Health checks ensure your node is functioning correctly, responding to requests, and staying synchronized with the network. This chapter covers comprehensive health monitoring strategies, including built-in endpoints, Prometheus metrics, alerting rules, and automated recovery procedures.
40.1 Health Check Endpoints
Section titled “40.1 Health Check Endpoints”Geth Health Endpoints
Section titled “Geth Health Endpoints”# Method 1: Using eth_blockNumber (basic)curl -X POST http://localhost:8545 \ -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}'
# Response:# {"jsonrpc":"2.0","id":1,"result":"0x10e7a2f"}
# Method 2: Using web3_clientVersioncurl -X POST http://localhost:8545 \ -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","method":"web3_clientVersion","params":[],"id":1}'
# Method 3: Check syncing statuscurl -X POST http://localhost:8545 \ -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}'
# Returns false if synced, or sync progress objectGeth Metrics Endpoint
Section titled “Geth Metrics Endpoint”# Prometheus metrics (must be enabled)# Start geth with --metrics --metrics.addr 0.0.0.0 --metrics.port 6060
# Access metricscurl http://localhost:6060/metrics
# Common metrics:# - go_goroutines (number of goroutines)# - go_memstats_alloc_bytes (memory usage)# - eth_block_header (current block)# - eth_sync_known_states (sync progress)Health Check Script
Section titled “Health Check Script”#!/bin/bashRPC_URL="http://localhost:8545"TIMEOUT=5
# Check RPC responsivenessresponse=$(curl -s -m $TIMEOUT -w "%{http_code}" -o /dev/null $RPC_URL \ -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}')
if [ "$response" = "200" ]; then # Check if synced sync_status=$(curl -s -m $TIMEOUT $RPC_URL \ -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}' \ | jq -r '.result')
if [ "$sync_status" = "false" ]; then echo "OK: Node is healthy and synced" exit 0 else echo "WARNING: Node is syncing" exit 1 fielse echo "CRITICAL: Node not responding" exit 2fi40.2 Monitoring Tools
Section titled “40.2 Monitoring Tools”Prometheus Configuration
Section titled “Prometheus Configuration”global: scrape_interval: 15s
scrape_configs: - job_name: 'geth' static_configs: - targets: ['localhost:6060'] metrics_path: /metrics
- job_name: 'node_exporter' static_configs: - targets: ['localhost:9100']Key Metrics to Monitor
Section titled “Key Metrics to Monitor”| Metric | Description | Alert Threshold |
|---|---|---|
eth_block_header | Current block number | < network block - 20 |
eth_syncing | Sync progress | Always true when syncing |
go_goroutines | Goroutine count | > 10000 |
go_memstats_alloc_bytes | Memory allocation | > 80% available |
p2p_peers | Connected peers | < 5 |
Grafana Dashboard JSON
Section titled “Grafana Dashboard JSON”{ "dashboard": { "title": "Ethereum Node Health", "panels": [ { "title": "Block Height", "type": "graph", "targets": [ { "expr": "eth_block_header", "legendFormat": "Current" }, { "expr": "eth_network_block", "legendFormat": "Network" } ] }, { "title": "Peer Count", "type": "graph", "targets": [ { "expr": "p2p_peers", "legendFormat": "Peers" } ] }, { "title": "Memory Usage", "type": "graph", "targets": [ { "expr": "go_memstats_alloc_bytes", "legendFormat": "Allocated" } ] } ] }}40.3 Alert Examples
Section titled “40.3 Alert Examples”Prometheus Alert Rules
Section titled “Prometheus Alert Rules”groups: - name: ethereum rules: # Node is down - alert: NodeDown expr: up{job="geth"} == 0 for: 1m labels: severity: critical annotations: summary: "Ethereum node is down" description: "Node {{ $labels.instance }} has been down for 5 minutes"
# Node not synced - alert: NodeNotSynced expr: (eth_network_block - eth_block_header) > 20 for: 10m labels: severity: warning annotations: summary: "Node is behind network" description: "Node is {{ $value }} blocks behind network"
# Too few peers - alert: FewPeers expr: p2p_peers < 5 for: 5m labels: severity: warning annotations: summary: "Low peer count" description: "Node has only {{ $value }} peers"
# High memory usage - alert: HighMemoryUsage expr: (go_memstats_alloc_bytes / 16GB) > 0.9 for: 5m labels: severity: warning annotations: summary: "High memory usage" description: "Memory usage above 90%"
# Goroutine leak - alert: HighGoroutines expr: go_goroutines > 10000 for: 10m labels: severity: warning annotations: summary: "High goroutine count" description: "Node has {{ $value }} goroutines"
# Sync stalled - alert: SyncStalled expr: rate(eth_block_header[10m]) == 0 for: 15m labels: severity: critical annotations: summary: "Sync has stalled" description: "No new blocks in 15 minutes"AlertManager Configuration
Section titled “AlertManager Configuration”global: resolve_timeout: 5m
route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'default' routes: - match: severity: critical receiver: 'critical-alerts' continue: true - match: severity: warning receiver: 'warning-alerts'
receivers: - name: 'default' webhook_configs: - url: 'http://localhost:5001/webhook'
- name: 'critical-alerts' webhook_configs: - url: 'http://localhost:5001/critical' pagerduty_configs: - service_key: 'YOUR_PAGERDUTY_KEY'
- name: 'warning-alerts' webhook_configs: - url: 'http://localhost:5001/warning'40.4 Uptime Monitoring Services
Section titled “40.4 Uptime Monitoring Services”External Monitoring Options
Section titled “External Monitoring Options”| Service | Features | Cost |
|---|---|---|
| UptimeRobot | HTTP checks, 5 min interval | Free tier available |
| Pingdom | Advanced checks, SMS alerts | Paid |
| Datadog | Full monitoring suite | Paid |
| Grafana Cloud | Metrics + logs + alerts | Paid |
| Custom | Prometheus + AlertManager | Self-hosted |
Setting Up UptimeRobot
Section titled “Setting Up UptimeRobot”# Create health check script that returns 200#!/bin/bashresponse=$(curl -s -o /dev/null -w "%{http_code}" \ -X POST http://localhost:8545 \ -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}')
if [ "$response" = "200" ]; then exit 0else exit 1fiContainer Health Checks
Section titled “Container Health Checks”services: geth: image: ethereum/client-go:latest healthcheck: test: ["CMD", "curl", "-f", "-X", "POST", "http://localhost:8545", "-H", "Content-Type: application/json", "-d", '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}'] interval: 30s timeout: 10s retries: 3 start_period: 60s deploy: restart_policy: condition: on-failure delay: 5s max_attempts: 340.5 Automated Recovery
Section titled “40.5 Automated Recovery”Systemd Auto-Restart
Section titled “Systemd Auto-Restart”[Unit]Description=Ethereum Geth NodeAfter=network.target
[Service]Type=simpleUser=ethereumExecStart=/usr/local/bin/geth --config /etc/geth/config.tomlRestart=alwaysRestartSec=10RestartSteps=10RestartInterval=60
# Memory limitsMemoryMax=32GMemoryHigh=24G
# WatchdogWatchdogSec=300
[Install]WantedBy=multi-user.targetAuto-Restart Script
Section titled “Auto-Restart Script”#!/bin/bashLOG_FILE="/var/log/node_restart.log"
# Check if node is healthyif ! curl -s -f -X POST http://localhost:8545 \ -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \ > /dev/null 2>&1; then
echo "$(date): Node unhealthy, attempting restart..." >> $LOG_FILE
# Try to restart systemctl restart geth
# Wait for restart sleep 30
# Check again if curl -s -f -X POST http://localhost:8545 \ -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \ > /dev/null 2>&1; then echo "$(date): Node recovered" >> $LOG_FILE else echo "$(date): Node failed to restart, alerting..." >> $LOG_FILE # Send alert (configure your alerting system) curl -X POST https://hooks.example.com/alert \ -d '{"message":"Node failed to restart"}' fifi40.6 Comprehensive Health Dashboard
Section titled “40.6 Comprehensive Health Dashboard”Dashboard Components
Section titled “Dashboard Components”┌─────────────────────────────────────────────────────────────┐│ NODE HEALTH DASHBOARD │├─────────────────────────────────────────────────────────────┤│ ││ ┌──────────────────┐ ┌──────────────────┐ ││ │ Block Height │ │ Sync Status │ ││ │ ████████████░░ │ │ ✓ Synced │ ││ │ 18,500,000 │ │ 5 min ago │ ││ └──────────────────┘ └──────────────────┘ ││ ││ ┌──────────────────┐ ┌──────────────────┐ ││ │ Peer Count │ │ Memory Usage │ ││ │ ██████████░░░░ │ │ ████████░░░░░░ │ ││ │ 25 peers │ │ 8.2 GB / 16 GB │ ││ └──────────────────┘ └──────────────────┘ ││ ││ ┌──────────────────┐ ┌──────────────────┐ ││ │ CPU Usage │ │ Disk I/O │ ││ │ ██████░░░░░░░ │ │ ████░░░░░░░░░░ │ ││ │ 45% │ │ 45 MB/s │ ││ └──────────────────┘ └──────────────────┘ ││ ││ Recent Alerts: ││ - ✓ No recent alerts ││ ││ Uptime: 99.98% (last 30 days) ││ │└─────────────────────────────────────────────────────────────┘40.7 Best Practices
Section titled “40.7 Best Practices”Checklist
Section titled “Checklist”- Enable metrics endpoint (
--metrics) - Configure Prometheus to scrape metrics
- Set up Grafana dashboard
- Configure alert rules for all critical metrics
- Set up external uptime monitoring
- Configure auto-restart on failure
- Test recovery procedures
- Document runbooks for each alert
- Review logs daily
- Schedule regular health checks
Runbook Template
Section titled “Runbook Template”# Alert: NodeDown
## Symptoms- Prometheus reports `up{job="geth"} == 0`- External monitoring shows node unreachable
## Impact- No RPC service available- Potential missed attestations (validator)
## Diagnosis1. Check if process is running: `systemctl status geth`2. Check logs: `journalctl -u geth -n 100`3. Check system resources: `top`, `df -h`
## Resolution1. If process crashed: `systemctl restart geth`2. If OOM: Increase memory or reduce cache3. If disk full: Clear old data or expand storage4. If port conflict: Check for other processes
## Prevention- Monitor memory and disk usage- Set appropriate resource limitsSummary
Section titled “Summary”- Health checks are essential for production nodes
- Use Prometheus + Grafana for comprehensive monitoring
- Configure alerts for all critical metrics
- Set up external monitoring for redundancy
- Implement auto-restart for common failures
- Create runbooks for each alert type
- Test recovery procedures regularly
- Maintain 99%+ uptime with proper monitoring
Next Chapter
Section titled “Next Chapter”In Chapter 41: Node Software Upgrades, we’ll cover upgrade procedures.
Last Updated: 2026-02-22