Skip to content

Prometheus_grafana


Prometheus is an open-source monitoring system with a dimensional data model, flexible query language, efficient storage, and modern alerting capabilities. It’s designed for reliability and is the standard for cloud-native monitoring.

Prometheus Architecture
+------------------------------------------------------------------+
| |
| Prometheus Monitoring Stack |
| |
| +-------------------------------------------------------------+|
| | Data Sources ||
| | +----------+ +----------+ +----------+ +-----------+ ||
| | | Node | | Docker | | Custom | | Blackbox | ||
| | | Exporter | | Exporter | | Exporter | | Exporter | ||
| | +----+-----+ +----+-----+ +----+-----+ +-------+---+ ||
| | | | | | ||
| | v v v v ||
| | +----------------------------------------------------------+ ||
| | | Prometheus Server | ||
| | | +--------------------------------------------------+ | ||
| | | | Scrapes targets at intervals | | ||
| | | | Stores time-series data | | ||
| | | | Evaluates alerting rules | | ||
| | | | Provides HTTP API | | ||
| | | +--------------------------------------------------+ | ||
| | +----------------------------------------------------------+ ||
| | | ||
| | v ||
| | +-------------------+ +------------------+ ||
| | | Grafana | | Alertmanager | ||
| | | (Visualization)| | (Alerts) | ||
| | +-------------------+ +------------------+ ||
| | | ||
| | v ||
| | +------------------+ |
| | | Notification | |
| | | (Email, Slack, | |
| | | PagerDuty) | |
| | +------------------+ |
| +-------------------------------------------------------------+
| |
| Data Flow: |
| 1. Exporters expose metrics on HTTP endpoints |
| 2. Prometheus scrapes metrics at configured intervals |
| 3. Metrics stored in time-series database |
| 4. Alertmanager handles alerts |
| 5. Grafana visualizes data |
| |
+------------------------------------------------------------------+
Terminal window
# Option 1: Binary download
wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz
tar xzf prometheus-*.tar.gz
cd prometheus-*
# Option 2: From package manager
# Debian/Ubuntu
echo "deb https://apt. prometheus.io/packages/ $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/prometheus.list
wget -q -O - https://apt.prometheus.io/prometheus.asc | sudo apt-key add -
sudo apt update && sudo apt install prometheus
# RHEL/CentOS
sudo tee /etc/yum.repos.d/prometheus.repo <<EOF
[prometheus]
name=Prometheus
baseurl=https://yum.prometheus.org/yum-release
enabled=1
gpgcheck=1
gpgkey=https://yum.prometheus.io/gpg
EOF
sudo dnf install prometheus
Terminal window
# Create user
sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus
# Create directories
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
# Copy binaries
sudo cp prometheus /usr/local/bin/
sudo cp promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool

/etc/prometheus/prometheus.yml
global:
scrape_interval: 15s # How often to scrape targets
evaluation_interval: 15s # How often to evaluate rules
external_labels: # Labels added to all metrics
cluster: 'production'
environment: 'us-east-1'
# Alerting configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Rule files
rule_files:
- "/etc/prometheus/rules/*.yml"
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
labels:
service: 'prometheus'
# Node Exporter (system metrics)
- job_name: 'node'
scrape_interval: 10s
static_configs:
- targets: ['localhost:9100']
labels:
service: 'node-exporter'
# Custom metrics
- job_name: 'custom-app'
metrics_path: '/metrics'
scrape_interval: 30s
static_configs:
- targets: ['app-server:8080']
labels:
app: 'myapplication'
env: 'production'
# Blackbox exporter
- job_name: 'blackbox'
metrics_path: '/probe'
params:
module: [http_2xx]
static_configs:
- targets: ['https://example.com', 'https://api.example.com']
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- source_labels: [__address__]
target_label: __address__
replacement: 'blackbox-exporter:9115'
/etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/
After=network.target
[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=15d \
--web.console.libraries=/usr/local/share/prometheus/console_libraries \
--web.console.templates=/usr/local/share/prometheus/consoles \
--web.enable-lifecycle
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target

Terminal window
# Download
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xzf node_exporter-*.tar.gz
sudo cp node_exporter-*/node_exporter /usr/local/bin/
# Service file
# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/node_exporter
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
# Enable collectors
# Start with specific collectors
/usr/local/bin/node_exporter --collector.cpu --collector.meminfo --collector.diskstats
# Disable collectors
/usr/local/bin/node_exporter --no-collector.arp
Prometheus Exporters
+------------------------------------------------------------------+
| |
| Official Exporters: |
| |
| +---------------------------+----------------------------------+|
| | Exporter | Port | Metrics ||
| | ---------------|----------|---------------------------------||
| | node_exporter | 9100 | CPU, memory, disk, network ||
| | blackbox_exporter| 9115| HTTP, DNS, TCP, ICMP ||
| | cadvisor | 8080 | Docker containers ||
| | postgres_exporter| 9187| PostgreSQL metrics ||
| | mysqld_exporter| 9104 | MySQL metrics ||
| | redis_exporter| 9121 | Redis metrics ||
| | nginx_exporter| 9113 | Nginx metrics ||
| | elastic_exporter| 9209| Elasticsearch metrics ||
| | jmx_exporter | 9404 | Java/JVM metrics ||
| | haproxy_exporter| 9101| HAProxy metrics ||
| | windows_exporter| 9182| Windows metrics ||
| | snmp_exporter | 9116 | SNMP metrics ||
| +---------------------------+----------------------------------+|
| |
+------------------------------------------------------------------+
my_exporter.py
#!/usr/bin/env python3
from prometheus_client import start_http_server, Gauge
import random
import time
# Define metrics
requests_total = Gauge('myapp_requests_total', 'Total requests')
response_time = Gauge('myapp_response_time_ms', 'Response time in ms')
active_users = Gauge('myapp_active_users', 'Active users')
def collect_metrics():
# Simulate metric collection
requests_total.inc(random.randint(1, 10))
response_time.set(random.randint(10, 500))
active_users.set(random.randint(100, 1000))
if __name__ == '__main__':
# Start HTTP server on port 8000
start_http_server(8000)
print("Exporter running on port 8000")
while True:
collect_metrics()
time.sleep(15)

# All metrics
up
# Metrics with label matching
up{job="node"}
up{job="node", instance="localhost:9100"}
# CPU idle (from node_exporter)
node_cpu_seconds_total{mode="idle"}
# Calculate CPU usage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Disk usage
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)
# Network traffic (bytes per second)
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
# Rate - calculate per-second rate
rate(http_requests_total[5m])
# Increase - total increase over time
increase(http_requests_total[1h])
# Count - number of series
count(up)
# Avg/Max/Min
avg(cpu_usage)
max(cpu_usage)
min(cpu_usage)
# Sum
sum(rate(http_requests_total[5m]))
# TopK/BottomK
topk(5, http_requests_total)
bottomk(5, response_time_ms)
# Histogram quantiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# By label
avg by (job) (cpu_usage)
sum by (service) (requests_total)
# Without label
avg without (instance) (cpu_usage)
# Sum
sum(requests_total)
# Count
count(up)
# Min/Max
min(node_memory_MemAvailable_bytes)
max(node_memory_MemTotal_bytes)
# Avg
avg(cpu_usage)
# Group
group(up{job="prometheus"})
# Stddev/Stdvar (standard deviation/variance)
stddev(response_time)
stdvar(response_time)
# TopK/BottomK
topk(10, cpu_usage)
bottomk(10, cpu_usage)
# Count values
count_values("version", build_version)

/etc/prometheus/rules/alerts.yml
groups:
- name: node_alerts
interval: 30s
rules:
# High CPU alert
- alert: HighCPU
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes"
# High memory alert
- alert: HighMemory
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85%"
# Disk space alert
- alert: DiskSpace
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100) < 15
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk space is below 15%"
# Instance down
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.job }} target has been down for more than 2 minutes"
# High request rate
- alert: HighRequestRate
expr: rate(http_requests_total[5m]) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High request rate on {{ $labels.instance }}"
/etc/prometheus/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'team-notifications'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
continue: true
- match:
service: database
receiver: 'dba-team'
receivers:
- name: 'team-notifications'
email_configs:
- to: 'team@example.com'
send_resolved: true
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts'
send_resolved: true
webhook_configs:
- url: 'https://webhook.example.com/alerts'
- name: 'critical-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#critical'
send_resolved: true
- name: 'dba-team'
email_configs:
- to: 'dba@example.com'

39.6 Grafana Installation and Configuration

Section titled “39.6 Grafana Installation and Configuration”
Terminal window
# Debian/Ubuntu
wget -q -O /usr/share/keyrings/grafana.asc https://packages.grafana.com/gpg.key
echo "deb [signed-by=/usr/share/keyrings/grafana.asc] https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install grafana
# RHEL/CentOS
sudo tee /etc/yum.repos.d/grafana.repo <<EOF
[grafana]
name=Grafana
baseurl=https://packages.grafana.com/oss/rpm
repo=https://packages.grafana.com/oss/rpm
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
EOF
sudo dnf install grafana
# Start service
sudo systemctl enable --now grafana-server
sudo systemctl status grafana-server
/etc/grafana/grafana.ini
[server]
protocol = http
http_addr = 0.0.0.0
http_port = 3000
domain = grafana.example.com
root_url = %(protocol)s://%(domain)s/grafana
[database]
type = sqlite3
path = /var/lib/grafana/grafana.db
[security]
admin_user = admin
admin_password = changeme
secret_key = SW2YcwTIb9zpOOhoPsMm
[users]
allow_sign_up = false
allow_org_create = false
default_role = viewer
[auth.anonymous]
enabled = false
[log]
mode = console
level = info
[log.console]
level = info
format = text

{
"dashboard": {
"title": "System Overview",
"tags": ["system", "overview"],
"timezone": "browser",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}}"
}
],
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}
},
{
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
"legendFormat": "{{instance}}"
}
],
"gridPos": {"x": 12, "y": 0, "w": 12, "h": 8}
},
{
"title": "Disk Usage",
"type": "gauge",
"targets": [
{
"expr": "100 - (node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"} * 100)"
}
],
"gridPos": {"x": 0, "y": 8, "w": 8, "h": 8}
},
{
"title": "Network Traffic",
"type": "graph",
"targets": [
{
"expr": "rate(node_network_receive_bytes_total{device!=\"lo\"}[5m])",
"legendFormat": "RX {{instance}}"
},
{
"expr": "rate(node_network_transmit_bytes_total{device!=\"lo\"}[5m])",
"legendFormat": "TX {{instance}}"
}
],
"gridPos": {"x": 8, "y": 8, "w": 16, "h": 8}
}
]
}
}
# Dashboard variables
# Define in dashboard settings > Variables
# Label values
label_values(node_exporter_build_info, job)
label_values(node_cpu_seconds_total{job="$job"}, instance)
# Query
query_result(LabelNames(http_requests_total))
# Custom all option
${var:regex}

# CPU by mode
100 - (avg by (mode) (irate(node_cpu_seconds_total[5m])) * 100)
# Memory breakdown
node_memory_MemTotal_bytes
node_memory_MemAvailable_bytes
node_memory_MemFree_bytes
node_memory_Buffers_bytes
node_memory_Cached_bytes
# Disk I/O
rate(node_disk_reads_completed_total[5m])
rate(node_disk_writes_completed_total[5m])
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])
# Network connections
node_netstat_Tcp_CurrEstab
node_netstat_Tcp_EstabResets
node_netstat_TcpPassiveOpens
node_netstat_TcpActiveOpens
# Load average
node_load1
node_load5
node_load15
# Filesystem by mount
node_filesystem_size_bytes{mountpoint="/"}
node_filesystem_avail_bytes{mountpoint="/"}
# Request rate
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
# Response time (histogram)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.90, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Active connections
sum(http_connections{state="active"})
sum(http_connections{state="idle"})
# Queue depth
application_queue_length

  1. What is Prometheus?

    • Open-source monitoring and alerting system with time-series database
  2. How does Prometheus differ from traditional monitoring?

    • Pull-based model, dimensional data model, PromQL, no reliance on distributed storage
  3. What is an exporter in Prometheus?

    • Tool that exposes existing metrics in Prometheus format
  4. What is PromQL?

    • Prometheus Query Language for querying time-series data
  5. What is the difference between push and pull monitoring?

    • Push: targets send metrics to collector; Pull: collector fetches from targets
  1. Explain Prometheus data model

    • Metrics with name, labels (key-value), timestamp, value
  2. What is the difference between gauge and counter?

    • Gauge: can go up and down; Counter: only increases
  3. How do you create an alert in Prometheus?

    • Define alert rules in rule_files, configure Alertmanager
  4. What is recording rules?

    • Pre-computed queries stored as new time series for faster queries
  5. How does service discovery work in Prometheus?

    • Automatically finds targets using DNS, Consul, Kubernetes, etc.
  1. What are the best practices for metric naming?

    • Use metric name with application prefix, use base units, include labels
  2. How do you scale Prometheus?

    • Federation, Thanos, Cortex for long-term storage and horizontal scaling
  3. Explain Prometheus high availability

    • Run multiple replicas, use same external labels, Alertmanager clustering
  4. What is the difference between histogram and summary?

    • Histogram: pre-defined buckets, client calculates; Summary: pre-defined quantiles
  5. How do you monitor applications not exposing Prometheus metrics?

    • Use Pushgateway for batch jobs, or use appropriate exporter

Quick Reference
+------------------------------------------------------------------+
| |
| Prometheus: |
| +----------------------------------------------------------+ |
| | Port 9090 | Web UI, API | |
| | scrape_interval | Default 15s | |
| | promtool | CLI tool for validation | |
| +----------------------------------------------------------+ |
| |
| Common Exporters: |
| +----------------------------------------------------------+ |
| | node_exporter:9100 | System metrics | |
| | blackbox:9115 | Blackbox probing | |
| | cadvisor:8080 | Container metrics | |
| | postgres:9187 | PostgreSQL metrics | |
| +----------------------------------------------------------+ |
| |
| Key PromQL Functions: |
| +----------------------------------------------------------+ |
| | rate() | Per-second rate | |
| | increase() | Total increase | |
| | histogram_quantile()| Percentiles | |
| | sum/avg/count | Aggregation | |
| | topk/bottomk | Top/bottom values | |
| +----------------------------------------------------------+ |
| |
| Grafana: |
| +----------------------------------------------------------+ |
| | Port 3000 | Default web interface | |
| | Data sources | Prometheus, Graphite, InfluxDB | |
| | Dashboards | JSON-based, shareable | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+