Health

Chapter 40: Health Checks & Uptime

Overview

Maintaining node health and uptime is critical for blockchain infrastructure. Health checks ensure your node is functioning correctly, responding to requests, and staying synchronized with the network. This chapter covers comprehensive health monitoring strategies, including built-in endpoints, Prometheus metrics, alerting rules, and automated recovery procedures.

40.1 Health Check Endpoints

Geth Health Endpoints

# Method 1: Using eth_blockNumber (basic)
curl -X POST http://localhost:8545 \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}'

# Response:
# {"jsonrpc":"2.0","id":1,"result":"0x10e7a2f"}

# Method 2: Using web3_clientVersion
curl -X POST http://localhost:8545 \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"web3_clientVersion","params":[],"id":1}'

# Method 3: Check syncing status
curl -X POST http://localhost:8545 \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}'

# Returns false if synced, or sync progress object

Geth Metrics Endpoint

# Prometheus metrics (must be enabled)
# Start geth with --metrics --metrics.addr 0.0.0.0 --metrics.port 6060

# Access metrics
curl http://localhost:6060/metrics

# Common metrics:
# - go_goroutines (number of goroutines)
# - go_memstats_alloc_bytes (memory usage)
# - eth_block_header (current block)
# - eth_sync_known_states (sync progress)

Health Check Script

#!/bin/bash
RPC_URL="http://localhost:8545"
TIMEOUT=5

# Check RPC responsiveness
response=$(curl -s -m $TIMEOUT -w "%{http_code}" -o /dev/null $RPC_URL \
    -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}')

if [ "$response" = "200" ]; then
    # Check if synced
    sync_status=$(curl -s -m $TIMEOUT $RPC_URL \
        -H "Content-Type: application/json" \
        -d '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}' \
        | jq -r '.result')

    if [ "$sync_status" = "false" ]; then
        echo "OK: Node is healthy and synced"
        exit 0
    else
        echo "WARNING: Node is syncing"
        exit 1
    fi
else
    echo "CRITICAL: Node not responding"
    exit 2
fi

40.2 Monitoring Tools

Prometheus Configuration

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'geth'
    static_configs:
      - targets: ['localhost:6060']
    metrics_path: /metrics

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

Key Metrics to Monitor

Metric	Description	Alert Threshold
`eth_block_header`	Current block number	< network block - 20
`eth_syncing`	Sync progress	Always true when syncing
`go_goroutines`	Goroutine count	> 10000
`go_memstats_alloc_bytes`	Memory allocation	> 80% available
`p2p_peers`	Connected peers	< 5

Grafana Dashboard JSON

{
  "dashboard": {
    "title": "Ethereum Node Health",
    "panels": [
      {
        "title": "Block Height",
        "type": "graph",
        "targets": [
          {
            "expr": "eth_block_header",
            "legendFormat": "Current"
          },
          {
            "expr": "eth_network_block",
            "legendFormat": "Network"
          }
        ]
      },
      {
        "title": "Peer Count",
        "type": "graph",
        "targets": [
          {
            "expr": "p2p_peers",
            "legendFormat": "Peers"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "go_memstats_alloc_bytes",
            "legendFormat": "Allocated"
          }
        ]
      }
    ]
  }
}

40.3 Alert Examples

Prometheus Alert Rules

groups:
  - name: ethereum
    rules:
      # Node is down
      - alert: NodeDown
        expr: up{job="geth"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Ethereum node is down"
          description: "Node {{ $labels.instance }} has been down for 5 minutes"

      # Node not synced
      - alert: NodeNotSynced
        expr: (eth_network_block - eth_block_header) > 20
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Node is behind network"
          description: "Node is {{ $value }} blocks behind network"

      # Too few peers
      - alert: FewPeers
        expr: p2p_peers < 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low peer count"
          description: "Node has only {{ $value }} peers"

      # High memory usage
      - alert: HighMemoryUsage
        expr: (go_memstats_alloc_bytes / 16GB) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Memory usage above 90%"

      # Goroutine leak
      - alert: HighGoroutines
        expr: go_goroutines > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High goroutine count"
          description: "Node has {{ $value }} goroutines"

      # Sync stalled
      - alert: SyncStalled
        expr: rate(eth_block_header[10m]) == 0
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Sync has stalled"
          description: "No new blocks in 15 minutes"

AlertManager Configuration

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
      continue: true
    - match:
        severity: warning
      receiver: 'warning-alerts'

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://localhost:5001/webhook'

  - name: 'critical-alerts'
    webhook_configs:
      - url: 'http://localhost:5001/critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'

  - name: 'warning-alerts'
    webhook_configs:
      - url: 'http://localhost:5001/warning'

40.4 Uptime Monitoring Services

External Monitoring Options

Service	Features	Cost
UptimeRobot	HTTP checks, 5 min interval	Free tier available
Pingdom	Advanced checks, SMS alerts	Paid
Datadog	Full monitoring suite	Paid
Grafana Cloud	Metrics + logs + alerts	Paid
Custom	Prometheus + AlertManager	Self-hosted

Setting Up UptimeRobot

# Create health check script that returns 200
#!/bin/bash
response=$(curl -s -o /dev/null -w "%{http_code}" \
    -X POST http://localhost:8545 \
    -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}')

if [ "$response" = "200" ]; then
    exit 0
else
    exit 1
fi

Container Health Checks

services:
  geth:
    image: ethereum/client-go:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "-X", "POST", "http://localhost:8545",
             "-H", "Content-Type: application/json",
             "-d", '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}']
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    deploy:
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3

40.5 Automated Recovery

Systemd Auto-Restart

[Unit]
Description=Ethereum Geth Node
After=network.target

[Service]
Type=simple
User=ethereum
ExecStart=/usr/local/bin/geth --config /etc/geth/config.toml
Restart=always
RestartSec=10
RestartSteps=10
RestartInterval=60

# Memory limits
MemoryMax=32G
MemoryHigh=24G

# Watchdog
WatchdogSec=300

[Install]
WantedBy=multi-user.target

Auto-Restart Script

#!/bin/bash
LOG_FILE="/var/log/node_restart.log"

# Check if node is healthy
if ! curl -s -f -X POST http://localhost:8545 \
    -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
    > /dev/null 2>&1; then

    echo "$(date): Node unhealthy, attempting restart..." >> $LOG_FILE

    # Try to restart
    systemctl restart geth

    # Wait for restart
    sleep 30

    # Check again
    if curl -s -f -X POST http://localhost:8545 \
        -H "Content-Type: application/json" \
        -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
        > /dev/null 2>&1; then
        echo "$(date): Node recovered" >> $LOG_FILE
    else
        echo "$(date): Node failed to restart, alerting..." >> $LOG_FILE
        # Send alert (configure your alerting system)
        curl -X POST https://hooks.example.com/alert \
            -d '{"message":"Node failed to restart"}'
    fi
fi

40.6 Comprehensive Health Dashboard

Dashboard Components

┌─────────────────────────────────────────────────────────────┐
│                   NODE HEALTH DASHBOARD                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────────┐  ┌──────────────────┐               │
│  │  Block Height    │  │  Sync Status     │               │
│  │  ████████████░░  │  │  ✓ Synced        │               │
│  │  18,500,000      │  │  5 min ago       │               │
│  └──────────────────┘  └──────────────────┘               │
│                                                              │
│  ┌──────────────────┐  ┌──────────────────┐               │
│  │  Peer Count      │  │  Memory Usage    │               │
│  │  ██████████░░░░  │  │  ████████░░░░░░  │               │
│  │  25 peers        │  │  8.2 GB / 16 GB  │               │
│  └──────────────────┘  └──────────────────┘               │
│                                                              │
│  ┌──────────────────┐  ┌──────────────────┐               │
│  │  CPU Usage       │  │  Disk I/O        │               │
│  │  ██████░░░░░░░  │  │  ████░░░░░░░░░░  │               │
│  │  45%             │  │  45 MB/s         │               │
│  └──────────────────┘  └──────────────────┘               │
│                                                              │
│  Recent Alerts:                                             │
│  - ✓ No recent alerts                                      │
│                                                              │
│  Uptime: 99.98% (last 30 days)                              │
│                                                              │
└─────────────────────────────────────────────────────────────┘

40.7 Best Practices

Checklist

Runbook Template

# Alert: NodeDown

## Symptoms
- Prometheus reports `up{job="geth"} == 0`
- External monitoring shows node unreachable

## Impact
- No RPC service available
- Potential missed attestations (validator)

## Diagnosis
1. Check if process is running: `systemctl status geth`
2. Check logs: `journalctl -u geth -n 100`
3. Check system resources: `top`, `df -h`

## Resolution
1. If process crashed: `systemctl restart geth`
2. If OOM: Increase memory or reduce cache
3. If disk full: Clear old data or expand storage
4. If port conflict: Check for other processes

## Prevention
- Monitor memory and disk usage
- Set appropriate resource limits

Summary

Health checks are essential for production nodes
Use Prometheus + Grafana for comprehensive monitoring
Configure alerts for all critical metrics
Set up external monitoring for redundancy
Implement auto-restart for common failures
Create runbooks for each alert type
Test recovery procedures regularly
Maintain 99%+ uptime with proper monitoring

Next Chapter

In Chapter 41: Node Software Upgrades, we’ll cover upgrade procedures.

Last Updated: 2026-02-22