Prometheus_grafana

Chapter 39: Prometheus and Grafana

Modern Monitoring and Visualization

39.1 Prometheus Architecture

Understanding Prometheus

Prometheus is an open-source monitoring system with a dimensional data model, flexible query language, efficient storage, and modern alerting capabilities. It’s designed for reliability and is the standard for cloud-native monitoring.

                 Prometheus Architecture
+------------------------------------------------------------------+
|                                                                   |
|                    Prometheus Monitoring Stack                       |
|                                                                   |
|    +-------------------------------------------------------------+|
|    |                       Data Sources                             ||
|    |  +----------+  +----------+  +----------+  +-----------+  ||
|    |  |  Node    |  |  Docker  |  |  Custom  |  | Blackbox  |  ||
|    |  | Exporter |  | Exporter |  | Exporter |  | Exporter  |  ||
|    |  +----+-----+  +----+-----+  +----+-----+  +-------+---+  ||
|    |       |               |               |               |        ||
|    |       v               v               v               v        ||
|    |  +----------------------------------------------------------+ ||
|    |  |              Prometheus Server                           | ||
|    |  |  +--------------------------------------------------+  | ||
|    |  |  |         Scrapes targets at intervals               |  | ||
|    |  |  |         Stores time-series data                    |  | ||
|    |  |  |         Evaluates alerting rules                  |  | ||
|    |  |  |         Provides HTTP API                         |  | ||
|    |  |  +--------------------------------------------------+  | ||
|    |  +----------------------------------------------------------+ ||
|    |                              |                               ||
|    |                              v                               ||
|    |  +-------------------+  +------------------+                 ||
|    |  |     Grafana       |  |    Alertmanager  |                 ||
|    |  |   (Visualization)|  |    (Alerts)     |                 ||
|    |  +-------------------+  +------------------+                 ||
|    |                              |                               ||
|    |                              v                               ||
|    |                     +------------------+                    |
|    |                     |  Notification   |                    |
|    |                     |  (Email, Slack,  |                    |
|    |                     |   PagerDuty)   |                    |
|    |                     +------------------+                    |
|    +-------------------------------------------------------------+
|                                                                   |
|    Data Flow:                                                      |
|    1. Exporters expose metrics on HTTP endpoints                  |
|    2. Prometheus scrapes metrics at configured intervals          |
|    3. Metrics stored in time-series database                      |
|    4. Alertmanager handles alerts                                |
|    5. Grafana visualizes data                                    |
|                                                                   |
+------------------------------------------------------------------+

Installation

# Option 1: Binary download
wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz
tar xzf prometheus-*.tar.gz
cd prometheus-*

# Option 2: From package manager
# Debian/Ubuntu
echo "deb https://apt. prometheus.io/packages/ $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/prometheus.list
wget -q -O - https://apt.prometheus.io/prometheus.asc | sudo apt-key add -
sudo apt update && sudo apt install prometheus

# RHEL/CentOS
sudo tee /etc/yum.repos.d/prometheus.repo <<EOF
[prometheus]
name=Prometheus
baseurl=https://yum.prometheus.org/yum-release
enabled=1
gpgcheck=1
gpgkey=https://yum.prometheus.io/gpg
EOF
sudo dnf install prometheus

Creating Prometheus User and Directories

# Create user
sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus

# Create directories
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

# Copy binaries
sudo cp prometheus /usr/local/bin/
sudo cp promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool

39.2 Prometheus Configuration

Main Configuration File

global:
  scrape_interval: 15s           # How often to scrape targets
  evaluation_interval: 15s      # How often to evaluate rules
  external_labels:               # Labels added to all metrics
    cluster: 'production'
    environment: 'us-east-1'

# Alerting configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

# Rule files
rule_files:
  - "/etc/prometheus/rules/*.yml"

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          service: 'prometheus'

  # Node Exporter (system metrics)
  - job_name: 'node'
    scrape_interval: 10s
    static_configs:
      - targets: ['localhost:9100']
        labels:
          service: 'node-exporter'

  # Custom metrics
  - job_name: 'custom-app'
    metrics_path: '/metrics'
    scrape_interval: 30s
    static_configs:
      - targets: ['app-server:8080']
        labels:
          app: 'myapplication'
          env: 'production'

  # Blackbox exporter
  - job_name: 'blackbox'
    metrics_path: '/probe'
    params:
      module: [http_2xx]
    static_configs:
      - targets: ['https://example.com', 'https://api.example.com']
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - source_labels: [__address__]
        target_label: __address__
        replacement: 'blackbox-exporter:9115'

Service File

[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/
After=network.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/var/lib/prometheus \
    --storage.tsdb.retention.time=15d \
    --web.console.libraries=/usr/local/share/prometheus/console_libraries \
    --web.console.templates=/usr/local/share/prometheus/consoles \
    --web.enable-lifecycle
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

39.3 Exporters

Node Exporter (System Metrics)

# Download
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xzf node_exporter-*.tar.gz
sudo cp node_exporter-*/node_exporter /usr/local/bin/

# Service file
# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/node_exporter
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target

# Enable collectors
# Start with specific collectors
/usr/local/bin/node_exporter --collector.cpu --collector.meminfo --collector.diskstats

# Disable collectors
/usr/local/bin/node_exporter --no-collector.arp

Common Exporters

                 Prometheus Exporters
+------------------------------------------------------------------+
|                                                                   |
|    Official Exporters:                                             |
|                                                                   |
|    +---------------------------+----------------------------------+|
|    | Exporter      | Port     | Metrics                          ||
|    | ---------------|----------|---------------------------------||
|    | node_exporter | 9100     | CPU, memory, disk, network      ||
|    | blackbox_exporter| 9115| HTTP, DNS, TCP, ICMP           ||
|    | cadvisor     | 8080     | Docker containers               ||
|    | postgres_exporter| 9187| PostgreSQL metrics              ||
|    | mysqld_exporter| 9104  | MySQL metrics                   ||
|    | redis_exporter| 9121   | Redis metrics                   ||
|    | nginx_exporter| 9113   | Nginx metrics                   ||
|    | elastic_exporter| 9209| Elasticsearch metrics          ||
|    | jmx_exporter | 9404    | Java/JVM metrics                ||
|    | haproxy_exporter| 9101| HAProxy metrics                 ||
|    | windows_exporter| 9182| Windows metrics                ||
|    | snmp_exporter | 9116   | SNMP metrics                   ||
|    +---------------------------+----------------------------------+|
|                                                                   |
+------------------------------------------------------------------+

Custom Exporter Example

#!/usr/bin/env python3
from prometheus_client import start_http_server, Gauge
import random
import time

# Define metrics
requests_total = Gauge('myapp_requests_total', 'Total requests')
response_time = Gauge('myapp_response_time_ms', 'Response time in ms')
active_users = Gauge('myapp_active_users', 'Active users')

def collect_metrics():
    # Simulate metric collection
    requests_total.inc(random.randint(1, 10))
    response_time.set(random.randint(10, 500))
    active_users.set(random.randint(100, 1000))

if __name__ == '__main__':
    # Start HTTP server on port 8000
    start_http_server(8000)
    print("Exporter running on port 8000")

    while True:
        collect_metrics()
        time.sleep(15)

39.4 PromQL (Prometheus Query Language)

Basic Queries

# All metrics
up

# Metrics with label matching
up{job="node"}
up{job="node", instance="localhost:9100"}

# CPU idle (from node_exporter)
node_cpu_seconds_total{mode="idle"}

# Calculate CPU usage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Disk usage
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)

# Network traffic (bytes per second)
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

Functions

# Rate - calculate per-second rate
rate(http_requests_total[5m])

# Increase - total increase over time
increase(http_requests_total[1h])

# Count - number of series
count(up)

# Avg/Max/Min
avg(cpu_usage)
max(cpu_usage)
min(cpu_usage)

# Sum
sum(rate(http_requests_total[5m]))

# TopK/BottomK
topk(5, http_requests_total)
bottomk(5, response_time_ms)

# Histogram quantiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# By label
avg by (job) (cpu_usage)
sum by (service) (requests_total)

# Without label
avg without (instance) (cpu_usage)

Aggregation Operators

# Sum
sum(requests_total)

# Count
count(up)

# Min/Max
min(node_memory_MemAvailable_bytes)
max(node_memory_MemTotal_bytes)

# Avg
avg(cpu_usage)

# Group
group(up{job="prometheus"})

# Stddev/Stdvar (standard deviation/variance)
stddev(response_time)
stdvar(response_time)

# TopK/BottomK
topk(10, cpu_usage)
bottomk(10, cpu_usage)

# Count values
count_values("version", build_version)

39.5 Alerting

Alert Rules

groups:
  - name: node_alerts
    interval: 30s
    rules:
      # High CPU alert
      - alert: HighCPU
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for more than 5 minutes"

      # High memory alert
      - alert: HighMemory
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85%"

      # Disk space alert
      - alert: DiskSpace
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100) < 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk space is below 15%"

      # Instance down
      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }} target has been down for more than 2 minutes"

      # High request rate
      - alert: HighRequestRate
        expr: rate(http_requests_total[5m]) > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High request rate on {{ $labels.instance }}"

Alertmanager Configuration

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'team-notifications'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
      continue: true
    - match:
        service: database
      receiver: 'dba-team'

receivers:
  - name: 'team-notifications'
    email_configs:
      - to: 'team@example.com'
        send_resolved: true
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'
        send_resolved: true
    webhook_configs:
      - url: 'https://webhook.example.com/alerts'

  - name: 'critical-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#critical'
        send_resolved: true

  - name: 'dba-team'
    email_configs:
      - to: 'dba@example.com'

39.6 Grafana Installation and Configuration

Installation

# Debian/Ubuntu
wget -q -O /usr/share/keyrings/grafana.asc https://packages.grafana.com/gpg.key
echo "deb [signed-by=/usr/share/keyrings/grafana.asc] https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install grafana

# RHEL/CentOS
sudo tee /etc/yum.repos.d/grafana.repo <<EOF
[grafana]
name=Grafana
baseurl=https://packages.grafana.com/oss/rpm
repo=https://packages.grafana.com/oss/rpm
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
EOF
sudo dnf install grafana

# Start service
sudo systemctl enable --now grafana-server
sudo systemctl status grafana-server

Configuration

[server]
protocol = http
http_addr = 0.0.0.0
http_port = 3000
domain = grafana.example.com
root_url = %(protocol)s://%(domain)s/grafana

[database]
type = sqlite3
path = /var/lib/grafana/grafana.db

[security]
admin_user = admin
admin_password = changeme
secret_key = SW2YcwTIb9zpOOhoPsMm

[users]
allow_sign_up = false
allow_org_create = false
default_role = viewer

[auth.anonymous]
enabled = false

[log]
mode = console
level = info

[log.console]
level = info
format = text

39.7 Grafana Dashboards

Creating Dashboards

{
  "dashboard": {
    "title": "System Overview",
    "tags": ["system", "overview"],
    "timezone": "browser",
    "panels": [
      {
        "title": "CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{instance}}"
          }
        ],
        "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
            "legendFormat": "{{instance}}"
          }
        ],
        "gridPos": {"x": 12, "y": 0, "w": 12, "h": 8}
      },
      {
        "title": "Disk Usage",
        "type": "gauge",
        "targets": [
          {
            "expr": "100 - (node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"} * 100)"
          }
        ],
        "gridPos": {"x": 0, "y": 8, "w": 8, "h": 8}
      },
      {
        "title": "Network Traffic",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(node_network_receive_bytes_total{device!=\"lo\"}[5m])",
            "legendFormat": "RX {{instance}}"
          },
          {
            "expr": "rate(node_network_transmit_bytes_total{device!=\"lo\"}[5m])",
            "legendFormat": "TX {{instance}}"
          }
        ],
        "gridPos": {"x": 8, "y": 8, "w": 16, "h": 8}
      }
    ]
  }
}

Variables

# Dashboard variables
# Define in dashboard settings > Variables

# Label values
label_values(node_exporter_build_info, job)
label_values(node_cpu_seconds_total{job="$job"}, instance)

# Query
query_result(LabelNames(http_requests_total))

# Custom all option
${var:regex}

39.8 Useful PromQL Queries

System Metrics Queries

# CPU by mode
100 - (avg by (mode) (irate(node_cpu_seconds_total[5m])) * 100)

# Memory breakdown
node_memory_MemTotal_bytes
node_memory_MemAvailable_bytes
node_memory_MemFree_bytes
node_memory_Buffers_bytes
node_memory_Cached_bytes

# Disk I/O
rate(node_disk_reads_completed_total[5m])
rate(node_disk_writes_completed_total[5m])
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

# Network connections
node_netstat_Tcp_CurrEstab
node_netstat_Tcp_EstabResets
node_netstat_TcpPassiveOpens
node_netstat_TcpActiveOpens

# Load average
node_load1
node_load5
node_load15

# Filesystem by mount
node_filesystem_size_bytes{mountpoint="/"}
node_filesystem_avail_bytes{mountpoint="/"}

Application Metrics

# Request rate
rate(http_requests_total[5m])

# Error rate
rate(http_requests_total{status=~"5.."}[5m])

# Response time (histogram)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.90, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Active connections
sum(http_connections{state="active"})
sum(http_connections{state="idle"})

# Queue depth
application_queue_length

39.9 Interview Questions

Basic Questions

What is Prometheus?
- Open-source monitoring and alerting system with time-series database
How does Prometheus differ from traditional monitoring?
- Pull-based model, dimensional data model, PromQL, no reliance on distributed storage
What is an exporter in Prometheus?
- Tool that exposes existing metrics in Prometheus format
What is PromQL?
- Prometheus Query Language for querying time-series data
What is the difference between push and pull monitoring?
- Push: targets send metrics to collector; Pull: collector fetches from targets

Intermediate Questions

Explain Prometheus data model
- Metrics with name, labels (key-value), timestamp, value
What is the difference between gauge and counter?
- Gauge: can go up and down; Counter: only increases
How do you create an alert in Prometheus?
- Define alert rules in rule_files, configure Alertmanager
What is recording rules?
- Pre-computed queries stored as new time series for faster queries
How does service discovery work in Prometheus?
- Automatically finds targets using DNS, Consul, Kubernetes, etc.

Advanced Questions

What are the best practices for metric naming?
- Use metric name with application prefix, use base units, include labels
How do you scale Prometheus?
- Federation, Thanos, Cortex for long-term storage and horizontal scaling
Explain Prometheus high availability
- Run multiple replicas, use same external labels, Alertmanager clustering
What is the difference between histogram and summary?
- Histogram: pre-defined buckets, client calculates; Summary: pre-defined quantiles
How do you monitor applications not exposing Prometheus metrics?
- Use Pushgateway for batch jobs, or use appropriate exporter

Summary

                 Quick Reference
+------------------------------------------------------------------+
|                                                                   |
|    Prometheus:                                                     |
|    +----------------------------------------------------------+  |
|    | Port 9090           | Web UI, API                        |  |
|    | scrape_interval      | Default 15s                        |  |
|    | promtool            | CLI tool for validation            |  |
|    +----------------------------------------------------------+  |
|                                                                   |
|    Common Exporters:                                               |
|    +----------------------------------------------------------+  |
|    | node_exporter:9100   | System metrics                   |  |
|    | blackbox:9115       | Blackbox probing                 |  |
|    | cadvisor:8080       | Container metrics               |  |
|    | postgres:9187       | PostgreSQL metrics               |  |
|    +----------------------------------------------------------+  |
|                                                                   |
|    Key PromQL Functions:                                          |
|    +----------------------------------------------------------+  |
|    | rate()              | Per-second rate                   |  |
|    | increase()          | Total increase                    |  |
|    | histogram_quantile()| Percentiles                       |  |
|    | sum/avg/count      | Aggregation                       |  |
|    | topk/bottomk       | Top/bottom values                |  |
|    +----------------------------------------------------------+  |
|                                                                   |
|    Grafana:                                                       |
|    +----------------------------------------------------------+  |
|    | Port 3000           | Default web interface            |  |
|    | Data sources       | Prometheus, Graphite, InfluxDB    |  |
|    | Dashboards         | JSON-based, shareable             |  |
|    +----------------------------------------------------------+  |
|                                                                   |
+------------------------------------------------------------------+