Prometheus_grafana
Chapter 39: Prometheus and Grafana
Section titled “Chapter 39: Prometheus and Grafana”Modern Monitoring and Visualization
Section titled “Modern Monitoring and Visualization”39.1 Prometheus Architecture
Section titled “39.1 Prometheus Architecture”Understanding Prometheus
Section titled “Understanding Prometheus”Prometheus is an open-source monitoring system with a dimensional data model, flexible query language, efficient storage, and modern alerting capabilities. It’s designed for reliability and is the standard for cloud-native monitoring.
Prometheus Architecture+------------------------------------------------------------------+| || Prometheus Monitoring Stack || || +-------------------------------------------------------------+|| | Data Sources ||| | +----------+ +----------+ +----------+ +-----------+ ||| | | Node | | Docker | | Custom | | Blackbox | ||| | | Exporter | | Exporter | | Exporter | | Exporter | ||| | +----+-----+ +----+-----+ +----+-----+ +-------+---+ ||| | | | | | ||| | v v v v ||| | +----------------------------------------------------------+ ||| | | Prometheus Server | ||| | | +--------------------------------------------------+ | ||| | | | Scrapes targets at intervals | | ||| | | | Stores time-series data | | ||| | | | Evaluates alerting rules | | ||| | | | Provides HTTP API | | ||| | | +--------------------------------------------------+ | ||| | +----------------------------------------------------------+ ||| | | ||| | v ||| | +-------------------+ +------------------+ ||| | | Grafana | | Alertmanager | ||| | | (Visualization)| | (Alerts) | ||| | +-------------------+ +------------------+ ||| | | ||| | v ||| | +------------------+ || | | Notification | || | | (Email, Slack, | || | | PagerDuty) | || | +------------------+ || +-------------------------------------------------------------+| || Data Flow: || 1. Exporters expose metrics on HTTP endpoints || 2. Prometheus scrapes metrics at configured intervals || 3. Metrics stored in time-series database || 4. Alertmanager handles alerts || 5. Grafana visualizes data || |+------------------------------------------------------------------+Installation
Section titled “Installation”# Option 1: Binary downloadwget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gztar xzf prometheus-*.tar.gzcd prometheus-*
# Option 2: From package manager# Debian/Ubuntuecho "deb https://apt. prometheus.io/packages/ $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/prometheus.listwget -q -O - https://apt.prometheus.io/prometheus.asc | sudo apt-key add -sudo apt update && sudo apt install prometheus
# RHEL/CentOSsudo tee /etc/yum.repos.d/prometheus.repo <<EOF[prometheus]name=Prometheusbaseurl=https://yum.prometheus.org/yum-releaseenabled=1gpgcheck=1gpgkey=https://yum.prometheus.io/gpgEOFsudo dnf install prometheusCreating Prometheus User and Directories
Section titled “Creating Prometheus User and Directories”# Create usersudo useradd --no-create-home --shell /usr/sbin/nologin prometheus
# Create directoriessudo mkdir -p /etc/prometheus /var/lib/prometheussudo chown prometheus:prometheus /var/lib/prometheus
# Copy binariessudo cp prometheus /usr/local/bin/sudo cp promtool /usr/local/bin/sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool39.2 Prometheus Configuration
Section titled “39.2 Prometheus Configuration”Main Configuration File
Section titled “Main Configuration File”global: scrape_interval: 15s # How often to scrape targets evaluation_interval: 15s # How often to evaluate rules external_labels: # Labels added to all metrics cluster: 'production' environment: 'us-east-1'
# Alerting configurationalerting: alertmanagers: - static_configs: - targets: - alertmanager:9093
# Rule filesrule_files: - "/etc/prometheus/rules/*.yml"
# Scrape configurationsscrape_configs: # Prometheus itself - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] labels: service: 'prometheus'
# Node Exporter (system metrics) - job_name: 'node' scrape_interval: 10s static_configs: - targets: ['localhost:9100'] labels: service: 'node-exporter'
# Custom metrics - job_name: 'custom-app' metrics_path: '/metrics' scrape_interval: 30s static_configs: - targets: ['app-server:8080'] labels: app: 'myapplication' env: 'production'
# Blackbox exporter - job_name: 'blackbox' metrics_path: '/probe' params: module: [http_2xx] static_configs: - targets: ['https://example.com', 'https://api.example.com'] relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - source_labels: [__address__] target_label: __address__ replacement: 'blackbox-exporter:9115'Service File
Section titled “Service File”[Unit]Description=Prometheus Monitoring SystemDocumentation=https://prometheus.io/docs/After=network.target
[Service]Type=simpleUser=prometheusGroup=prometheusExecStart=/usr/local/bin/prometheus \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/var/lib/prometheus \ --storage.tsdb.retention.time=15d \ --web.console.libraries=/usr/local/share/prometheus/console_libraries \ --web.console.templates=/usr/local/share/prometheus/consoles \ --web.enable-lifecycleRestart=on-failureRestartSec=5s
[Install]WantedBy=multi-user.target39.3 Exporters
Section titled “39.3 Exporters”Node Exporter (System Metrics)
Section titled “Node Exporter (System Metrics)”# Downloadwget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gztar xzf node_exporter-*.tar.gzsudo cp node_exporter-*/node_exporter /usr/local/bin/
# Service file# /etc/systemd/system/node_exporter.service[Unit]Description=Node ExporterAfter=network.target
[Service]Type=simpleUser=prometheusExecStart=/usr/local/bin/node_exporterRestart=alwaysRestartSec=5s
[Install]WantedBy=multi-user.target
# Enable collectors# Start with specific collectors/usr/local/bin/node_exporter --collector.cpu --collector.meminfo --collector.diskstats
# Disable collectors/usr/local/bin/node_exporter --no-collector.arpCommon Exporters
Section titled “Common Exporters” Prometheus Exporters+------------------------------------------------------------------+| || Official Exporters: || || +---------------------------+----------------------------------+|| | Exporter | Port | Metrics ||| | ---------------|----------|---------------------------------||| | node_exporter | 9100 | CPU, memory, disk, network ||| | blackbox_exporter| 9115| HTTP, DNS, TCP, ICMP ||| | cadvisor | 8080 | Docker containers ||| | postgres_exporter| 9187| PostgreSQL metrics ||| | mysqld_exporter| 9104 | MySQL metrics ||| | redis_exporter| 9121 | Redis metrics ||| | nginx_exporter| 9113 | Nginx metrics ||| | elastic_exporter| 9209| Elasticsearch metrics ||| | jmx_exporter | 9404 | Java/JVM metrics ||| | haproxy_exporter| 9101| HAProxy metrics ||| | windows_exporter| 9182| Windows metrics ||| | snmp_exporter | 9116 | SNMP metrics ||| +---------------------------+----------------------------------+|| |+------------------------------------------------------------------+Custom Exporter Example
Section titled “Custom Exporter Example”#!/usr/bin/env python3from prometheus_client import start_http_server, Gaugeimport randomimport time
# Define metricsrequests_total = Gauge('myapp_requests_total', 'Total requests')response_time = Gauge('myapp_response_time_ms', 'Response time in ms')active_users = Gauge('myapp_active_users', 'Active users')
def collect_metrics(): # Simulate metric collection requests_total.inc(random.randint(1, 10)) response_time.set(random.randint(10, 500)) active_users.set(random.randint(100, 1000))
if __name__ == '__main__': # Start HTTP server on port 8000 start_http_server(8000) print("Exporter running on port 8000")
while True: collect_metrics() time.sleep(15)39.4 PromQL (Prometheus Query Language)
Section titled “39.4 PromQL (Prometheus Query Language)”Basic Queries
Section titled “Basic Queries”# All metricsup
# Metrics with label matchingup{job="node"}up{job="node", instance="localhost:9100"}
# CPU idle (from node_exporter)node_cpu_seconds_total{mode="idle"}
# Calculate CPU usage100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Disk usage100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)
# Network traffic (bytes per second)rate(node_network_receive_bytes_total[5m])rate(node_network_transmit_bytes_total[5m])Functions
Section titled “Functions”# Rate - calculate per-second raterate(http_requests_total[5m])
# Increase - total increase over timeincrease(http_requests_total[1h])
# Count - number of seriescount(up)
# Avg/Max/Minavg(cpu_usage)max(cpu_usage)min(cpu_usage)
# Sumsum(rate(http_requests_total[5m]))
# TopK/BottomKtopk(5, http_requests_total)bottomk(5, response_time_ms)
# Histogram quantileshistogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# By labelavg by (job) (cpu_usage)sum by (service) (requests_total)
# Without labelavg without (instance) (cpu_usage)Aggregation Operators
Section titled “Aggregation Operators”# Sumsum(requests_total)
# Countcount(up)
# Min/Maxmin(node_memory_MemAvailable_bytes)max(node_memory_MemTotal_bytes)
# Avgavg(cpu_usage)
# Groupgroup(up{job="prometheus"})
# Stddev/Stdvar (standard deviation/variance)stddev(response_time)stdvar(response_time)
# TopK/BottomKtopk(10, cpu_usage)bottomk(10, cpu_usage)
# Count valuescount_values("version", build_version)39.5 Alerting
Section titled “39.5 Alerting”Alert Rules
Section titled “Alert Rules”groups: - name: node_alerts interval: 30s rules: # High CPU alert - alert: HighCPU expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is above 80% for more than 5 minutes"
# High memory alert - alert: HighMemory expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is above 85%"
# Disk space alert - alert: DiskSpace expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100) < 15 for: 5m labels: severity: warning annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk space is below 15%"
# Instance down - alert: InstanceDown expr: up == 0 for: 2m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} is down" description: "{{ $labels.job }} target has been down for more than 2 minutes"
# High request rate - alert: HighRequestRate expr: rate(http_requests_total[5m]) > 1000 for: 5m labels: severity: warning annotations: summary: "High request rate on {{ $labels.instance }}"Alertmanager Configuration
Section titled “Alertmanager Configuration”global: resolve_timeout: 5m smtp_smarthost: 'smtp.example.com:587' smtp_from: 'alerts@example.com' smtp_auth_username: 'alerts' smtp_auth_password: 'password'
route: group_by: ['alertname', 'cluster'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'team-notifications' routes: - match: severity: critical receiver: 'critical-alerts' continue: true - match: service: database receiver: 'dba-team'
receivers: - name: 'team-notifications' email_configs: - to: 'team@example.com' send_resolved: true slack_configs: - api_url: 'https://hooks.slack.com/services/xxx' channel: '#alerts' send_resolved: true webhook_configs: - url: 'https://webhook.example.com/alerts'
- name: 'critical-alerts' slack_configs: - api_url: 'https://hooks.slack.com/services/xxx' channel: '#critical' send_resolved: true
- name: 'dba-team' email_configs: - to: 'dba@example.com'39.6 Grafana Installation and Configuration
Section titled “39.6 Grafana Installation and Configuration”Installation
Section titled “Installation”# Debian/Ubuntuwget -q -O /usr/share/keyrings/grafana.asc https://packages.grafana.com/gpg.keyecho "deb [signed-by=/usr/share/keyrings/grafana.asc] https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.listsudo apt updatesudo apt install grafana
# RHEL/CentOSsudo tee /etc/yum.repos.d/grafana.repo <<EOF[grafana]name=Grafanabaseurl=https://packages.grafana.com/oss/rpmrepo=https://packages.grafana.com/oss/rpmgpgcheck=1gpgkey=https://packages.grafana.com/gpg.keyEOFsudo dnf install grafana
# Start servicesudo systemctl enable --now grafana-serversudo systemctl status grafana-serverConfiguration
Section titled “Configuration”[server]protocol = httphttp_addr = 0.0.0.0http_port = 3000domain = grafana.example.comroot_url = %(protocol)s://%(domain)s/grafana
[database]type = sqlite3path = /var/lib/grafana/grafana.db
[security]admin_user = adminadmin_password = changemesecret_key = SW2YcwTIb9zpOOhoPsMm
[users]allow_sign_up = falseallow_org_create = falsedefault_role = viewer
[auth.anonymous]enabled = false
[log]mode = consolelevel = info
[log.console]level = infoformat = text39.7 Grafana Dashboards
Section titled “39.7 Grafana Dashboards”Creating Dashboards
Section titled “Creating Dashboards”{ "dashboard": { "title": "System Overview", "tags": ["system", "overview"], "timezone": "browser", "panels": [ { "title": "CPU Usage", "type": "graph", "targets": [ { "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)", "legendFormat": "{{instance}}" } ], "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8} }, { "title": "Memory Usage", "type": "graph", "targets": [ { "expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100", "legendFormat": "{{instance}}" } ], "gridPos": {"x": 12, "y": 0, "w": 12, "h": 8} }, { "title": "Disk Usage", "type": "gauge", "targets": [ { "expr": "100 - (node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"} * 100)" } ], "gridPos": {"x": 0, "y": 8, "w": 8, "h": 8} }, { "title": "Network Traffic", "type": "graph", "targets": [ { "expr": "rate(node_network_receive_bytes_total{device!=\"lo\"}[5m])", "legendFormat": "RX {{instance}}" }, { "expr": "rate(node_network_transmit_bytes_total{device!=\"lo\"}[5m])", "legendFormat": "TX {{instance}}" } ], "gridPos": {"x": 8, "y": 8, "w": 16, "h": 8} } ] }}Variables
Section titled “Variables”# Dashboard variables# Define in dashboard settings > Variables
# Label valueslabel_values(node_exporter_build_info, job)label_values(node_cpu_seconds_total{job="$job"}, instance)
# Queryquery_result(LabelNames(http_requests_total))
# Custom all option${var:regex}39.8 Useful PromQL Queries
Section titled “39.8 Useful PromQL Queries”System Metrics Queries
Section titled “System Metrics Queries”# CPU by mode100 - (avg by (mode) (irate(node_cpu_seconds_total[5m])) * 100)
# Memory breakdownnode_memory_MemTotal_bytesnode_memory_MemAvailable_bytesnode_memory_MemFree_bytesnode_memory_Buffers_bytesnode_memory_Cached_bytes
# Disk I/Orate(node_disk_reads_completed_total[5m])rate(node_disk_writes_completed_total[5m])rate(node_disk_read_bytes_total[5m])rate(node_disk_written_bytes_total[5m])
# Network connectionsnode_netstat_Tcp_CurrEstabnode_netstat_Tcp_EstabResetsnode_netstat_TcpPassiveOpensnode_netstat_TcpActiveOpens
# Load averagenode_load1node_load5node_load15
# Filesystem by mountnode_filesystem_size_bytes{mountpoint="/"}node_filesystem_avail_bytes{mountpoint="/"}Application Metrics
Section titled “Application Metrics”# Request raterate(http_requests_total[5m])
# Error raterate(http_requests_total{status=~"5.."}[5m])
# Response time (histogram)histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))histogram_quantile(0.90, rate(http_request_duration_seconds_bucket[5m]))histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Active connectionssum(http_connections{state="active"})sum(http_connections{state="idle"})
# Queue depthapplication_queue_length39.9 Interview Questions
Section titled “39.9 Interview Questions”Basic Questions
Section titled “Basic Questions”-
What is Prometheus?
- Open-source monitoring and alerting system with time-series database
-
How does Prometheus differ from traditional monitoring?
- Pull-based model, dimensional data model, PromQL, no reliance on distributed storage
-
What is an exporter in Prometheus?
- Tool that exposes existing metrics in Prometheus format
-
What is PromQL?
- Prometheus Query Language for querying time-series data
-
What is the difference between push and pull monitoring?
- Push: targets send metrics to collector; Pull: collector fetches from targets
Intermediate Questions
Section titled “Intermediate Questions”-
Explain Prometheus data model
- Metrics with name, labels (key-value), timestamp, value
-
What is the difference between gauge and counter?
- Gauge: can go up and down; Counter: only increases
-
How do you create an alert in Prometheus?
- Define alert rules in rule_files, configure Alertmanager
-
What is recording rules?
- Pre-computed queries stored as new time series for faster queries
-
How does service discovery work in Prometheus?
- Automatically finds targets using DNS, Consul, Kubernetes, etc.
Advanced Questions
Section titled “Advanced Questions”-
What are the best practices for metric naming?
- Use metric name with application prefix, use base units, include labels
-
How do you scale Prometheus?
- Federation, Thanos, Cortex for long-term storage and horizontal scaling
-
Explain Prometheus high availability
- Run multiple replicas, use same external labels, Alertmanager clustering
-
What is the difference between histogram and summary?
- Histogram: pre-defined buckets, client calculates; Summary: pre-defined quantiles
-
How do you monitor applications not exposing Prometheus metrics?
- Use Pushgateway for batch jobs, or use appropriate exporter
Summary
Section titled “Summary” Quick Reference+------------------------------------------------------------------+| || Prometheus: || +----------------------------------------------------------+ || | Port 9090 | Web UI, API | || | scrape_interval | Default 15s | || | promtool | CLI tool for validation | || +----------------------------------------------------------+ || || Common Exporters: || +----------------------------------------------------------+ || | node_exporter:9100 | System metrics | || | blackbox:9115 | Blackbox probing | || | cadvisor:8080 | Container metrics | || | postgres:9187 | PostgreSQL metrics | || +----------------------------------------------------------+ || || Key PromQL Functions: || +----------------------------------------------------------+ || | rate() | Per-second rate | || | increase() | Total increase | || | histogram_quantile()| Percentiles | || | sum/avg/count | Aggregation | || | topk/bottomk | Top/bottom values | || +----------------------------------------------------------+ || || Grafana: || +----------------------------------------------------------+ || | Port 3000 | Default web interface | || | Data sources | Prometheus, Graphite, InfluxDB | || | Dashboards | JSON-based, shareable | || +----------------------------------------------------------+ || |+------------------------------------------------------------------+