Architecture

Chapter 3: PostgreSQL Architecture

Understanding the Internal Working of PostgreSQL

3.1 Overview of PostgreSQL Architecture

PostgreSQL follows a client-server architecture where multiple processes work together to handle database operations. Understanding this architecture is crucial for effective database administration and performance tuning.

                    PostgreSQL Architecture Overview
    ========================================================================

                                    ┌──────────────────┐
                                    │  Client         │
                                    │  Applications   │
                                    └────────┬─────────┘
                                             │ Network
                                             │ (TCP/IP)
                                             ▼
    ┌──────────────────────────────────────────────────────────────────────┐
    │                       PostgreSQL Server                               │
    │                                                                       │
    │    ┌──────────────────────────────────────────────────────────────┐ │
    │    │                    Postmaster (Parent Process)                │ │
    │    │                    Port: 5432                                  │ │
    │    │                                                               │ │
    │    │  • Listens for incoming connections                          │ │
    │    │  • Forks backend processes for each connection               │ │
    │    │  • Manages shared memory and semaphores                     │ │
    │    │  • Handles crash recovery                                    │ │
    │    └──────────────────────────────────────────────────────────────┘ │
    │                                                                       │
    │    ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                 │
    │    │  Backend    │  │  Backend    │  │  Backend    │                 │
    │    │  Process 1  │  │  Process 2  │  │  Process N  │                 │
    │    │             │  │             │  │             │                 │
    │    │ - Handles   │  │ - Handles   │  │ - Handles   │                 │
    │    │   queries   │  │   queries   │  │   queries   │                 │
    │    │ - Executes  │  │ - Executes  │  │ - Executes  │                 │
    │    │   SQL       │  │   SQL       │  │   SQL       │                 │
    │    └─────────────┘  └─────────────┘  └─────────────┘                 │
    │                                                                       │
    │    ┌──────────────────────────────────────────────────────────────┐  │
    │    │                   Background Processes                        │  │
    │    │                                                               │  │
    │    │  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐│  │
    │    │  │ Logger    │ │ Auto       │ │ WAL       │ │ Checkpoint ││  │
    │    │  │           │ │ vacuum     │ │ Writer    │ │            ││  │
    │    │  └────────────┘ └────────────┘ └────────────┘ └────────────┘│  │
    │    │                                                               │  │
    │    │  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐│  │
    │    │  │ Background│ │ Stats      │ │ Archiver  │ │ Replication││  │
    │    │  │ Writer    │ │ Collector  │ │           │ │ Launcher  ││  │
    │    │  └────────────┘ └────────────┘ └────────────┘ └────────────┘│  │
    │    └──────────────────────────────────────────────────────────────┘  │
    │                                                                       │
    └──────────────────────────────────────────────────────────────────────┘
                                             │
                                             ▼
    ┌──────────────────────────────────────────────────────────────────────┐
    │                       Shared Memory                                   │
    │    ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐          │
    │    │ Shared   │  │  WAL    │  │  Clog    │  │  Lock    │          │
    │    │ Buffers  │  │ Buffers  │  │ Buffer   │  │ Manager  │          │
    │    └──────────┘  └──────────┘  └──────────┘  └──────────┘          │
    └──────────────────────────────────────────────────────────────────────┘
                                             │
                                             ▼
    ┌──────────────────────────────────────────────────────────────────────┐
    │                         Disk Storage                                  │
    │    ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐          │
    │    │  Data   │  │   WAL   │  │  Logs    │  │  Stats   │          │
    │    │  Files  │  │  Files   │  │          │  │  Files   │          │
    │    └──────────┘  └──────────┘  └──────────┘  └──────────┘          │
    └──────────────────────────────────────────────────────────────────────┘
    ========================================================================

3.2 Postmaster Process

The postmaster is the primary process that manages the PostgreSQL database cluster. It acts as a coordinator for all other processes.

                    Postmaster Process Role
    ========================================================================

    Postmaster Responsibilities:
    ============================

    1. Initialization
    ┌─────────────────────────────────────────────────────────────┐
    │                                                             │
    │   - Initializes shared memory structures                   │
    │   - Starts background processes                             │
    │   - Sets up signal handlers                                │
    │   - Listens on port 5432 (default)                         │
    │                                                             │
    └─────────────────────────────────────────────────────────────┘

    2. Connection Management
    ┌─────────────────────────────────────────────────────────────┐
    │                                                             │
    │   - Listens for incoming client connections                 │
    │   - Authenticates clients (via pg_hba.conf)               │
    │   - Forks backend processes for each connection            │
    │   - Manages connection limits                              │
    │                                                             │
    └─────────────────────────────────────────────────────────────┘

    3. Process Management
    ┌─────────────────────────────────────────────────────────────┐
    │                                                             │
    │   - Monitors backend processes                              │
    │   - Handles abnormal terminations                           │
    │   - Performs crash recovery                                 │
    │   - Coordinates checkpoint operations                       │
    │                                                             │
    └─────────────────────────────────────────────────────────────┘

    4. System Lifecycle
    ┌─────────────────────────────────────────────────────────────┐
    │                                                             │
    │   pg_ctl start     → Start postmaster                      │
    │   pg_ctl stop     → Graceful shutdown                      │
    │   pg_ctl restart  → Stop and start postmaster              │
    │   pg_ctl reload   → Signal postmaster to reload config    │
    │                                                             │
    └─────────────────────────────────────────────────────────────┘
    ========================================================================

3.3 Backend Process

Each client connection is handled by a separate backend process. This architecture provides isolation between connections.

                    Backend Process Architecture
    ========================================================================

    Connection Flow:
    =================

    ┌──────────────┐      ┌──────────────┐      ┌──────────────┐
    │   Client     │      │  Postmaster  │      │   Backend   │
    │ Application  │─────▶│              │─────▶│   Process   │
    └──────────────┘      └──────────────┘      └──────────────┘
                                │                     │
                                │                     │
                                │ forks               │
                                │                     │
                                ▼                     ▼
                          ┌──────────────┐      ┌──────────────┐
                          │   listens   │      │  executes    │
                          │   for new   │      │  queries     │
                          │   connections              │
                          └──────────────┘      └──────────────┘


    Backend Process Lifecycle:
    ==========================

    ┌────────────────────────────────────────────────────────────────┐
    │                                                                │
    │  START                                                         │
    │    │                                                          │
    │    ▼                                                          │
    │  ┌──────────────────────┐                                      │
    │  │ Authentication      │ ← Handshake with client             │
    │  │ Phase               │                                      │
    │  └──────────┬───────────┘                                      │
    │             │                                                  │
    │             ▼                                                  │
    │  ┌──────────────────────┐                                      │
    │  │ Session              │ ← Initialize session state         │
    │  │ Initialization       │   - Set search_path                │
    │  │                      │   - Load extensions                │
    │  └──────────┬───────────┘   - Initialize temp tables          │
    │             │                                                  │
    │             ▼                                                  │
    │  ┌──────────────────────┐                                      │
    │  │ Query Processing     │ ← Main loop:                       │
    │  │ Phase                │   - Receive query                   │
    │  │                      │   - Parse & execute                 │
    │  │                      │   - Send results                    │
    │  └──────────┬───────────┘                                      │
    │             │                                                  │
    │             ▼                                                  │
    │  ┌──────────────────────┐                                      │
    │  │ Disconnection       │ ← Client disconnects or             │
    │  │ Phase               │   connection timeout                │
    │  └──────────┬───────────┘                                      │
    │             │                                                  │
    │             ▼                                                  │
    │           END                                                   │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
    ========================================================================

3.4 Query Processing Flow

Understanding how PostgreSQL processes queries is essential for optimization and debugging.

                    Query Processing Pipeline
    ========================================================================

    SQL Query Processing Stages:
    =============================

    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
    │    SQL      │───▶│   Parser    │───▶│  Optimizer  │───▶│  Executor   │
    │   Input     │    │             │    │             │    │             │
    └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                             │                   │                   │
                             ▼                   ▼                   ▼
                     ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
                     │ Parse Tree  │    │  Plan Tree  │    │   Results   │
                     │             │    │             │    │             │
                     │ - SQL syntax│    │ - Query     │    │ - Row data  │
                     │ - Semantic  │    │   rewrite   │    │ - Error     │
                     │   analysis  │    │ - Plan      │    │   messages  │
                     │             │    │   selection │    │             │
                     └─────────────┘    └─────────────┘    └─────────────┘


    Detailed Query Processing:
    ===========================

    1. PARSER
    ┌─────────────────────────────────────────────────────────────┐
    │ Input:  SELECT * FROM users WHERE id = 1;                 │
    │                                                          │
    │ Parser creates Parse Tree:                                │
    │                                                          │
    │   ┌─────────────────────────────────────────────┐        │
    │   │ SELECTStmt                                   │        │
    │   │   ├── targetList: *                         │        │
    │   │   ├── fromClause: users                     │        │
    │   │   └── whereClause: id = 1                   │        │
    │   └─────────────────────────────────────────────┘        │
    │                                                          │
    │ Checks: SQL syntax, table existence, column existence   │
    └─────────────────────────────────────────────────────────────┘

    2. TRANSFORMER
    ┌─────────────────────────────────────────────────────────────┐
    │ Transforms Parse Tree into Query Tree:                     │
    │                                                          │
    │   ┌─────────────────────────────────────────────┐        │
    │   │ Query                                        │        │
    │   │   ├── resultRelation: users                 │        │
    │   │   ├── jointree: users                       │        │
    │   │   ├── targetList: id, name, email          │        │
    │   │   └── qual: id = 1                         │        │
    │   └─────────────────────────────────────────────┘        │
    │                                                          │
    │ Adds type information, resolves column types             │
    └─────────────────────────────────────────────────────────────┘

    3. REWRITER
    ┌─────────────────────────────────────────────────────────────┐
    │ Applies rules (views, rules):                             │
    │                                                          │
    │ CREATE VIEW user_summary AS                               │
    │ SELECT id, name FROM users;                              │
    │                                                          │
    │ SELECT * FROM user_summary;                              │
    │                                                          │
    │ Rewritten to:                                             │
    │ SELECT id, name FROM users;                             │
    │                                                          │
    └─────────────────────────────────────────────────────────────┘

    4. OPTIMIZER (Planner)
    ┌─────────────────────────────────────────────────────────────┐
    │ Creates optimal execution plan:                          │
    │                                                          │
    │ Consider available indexes:                              │
    │ - users_pkey (B-tree on id)                              │
    │                                                          │
    │ Possible plans:                                          │
    │ 1. Seq Scan on users, Filter: id = 1                     │
    │ 2. Index Scan using users_pkey, WHERE: id = 1           │
    │                                                          │
    │ Selects: Index Scan (cheaper)                            │
    └─────────────────────────────────────────────────────────────┘

    5. EXECUTOR
    ┌─────────────────────────────────────────────────────────────┐
    │ Executes the plan:                                        │
    │                                                          │
    │ ┌──────────────────────────────────────────────────────┐ │
    │ │ Executor Steps (Index Scan on users_pkey):            │ │
    │ │                                                       │ │
    │ │ 1. Initialize index scan                              │ │
    │ │ 2. Search index for id = 1                            │ │
    │ │ 3. Fetch heap tuple for each index entry             │ │
    │ │ 4. Apply visibility check (MVCC)                      │ │
    │ │ 5. Return matching tuples                            │ │
    │ │                                                       │ │
    │ └──────────────────────────────────────────────────────┘ │
    │                                                          │
    │ Returns: (1, 'John', 'john@example.com')                 │
    └─────────────────────────────────────────────────────────────┘
    ========================================================================

3.5 Background Processes

PostgreSQL uses multiple background processes to handle various system tasks efficiently.

                    Background Processes Overview
    ========================================================================

    ┌──────────────────────────────────────────────────────────────────────┐
    │                      Background Processes                            │
    │                                                                       │
    │  ┌──────────────────────────────────────────────────────────────────┐ │
    │  │ 1. Logger Process (logger)                                      │ │
    │  │                                                                   │ │
    │  │ • Captures stderr output from other processes                   │ │
    │  │ • Writes to PostgreSQL log files                                │ │
    │  │ • Manages log rotation                                          │ │
    │  │ • Controlled by logging_collector setting                       │ │
    │  └──────────────────────────────────────────────────────────────────┘ │
    │                                                                       │
    │  ┌──────────────────────────────────────────────────────────────────┐ │
    │  │ 2. Autovacuum Launcher                                          │ │
    │  │                                                                   │ │
    │  │ • Automatically runs VACUUM on bloated tables                   │ │
    │  │ • Runs ANALYZE to update statistics                             │ │
    │  │ • Prevents transaction ID wraparound                            │ │
    │  │ • Configurable via autovacuum_* parameters                     │ │
    │  │                                                                   │ │
    │  │   ┌──────────────────────────────────────────┐                  │ │
    │  │   │ Autovacuum Process                       │                  │ │
    │  │   │                                          │                  │ │
    │  │   │  Table: users                            │                  │ │
    │  │   │  Dead tuples: 15,000                     │                  │ │
    │  │   │  Threshold: 10,000                      │                  │ │
    │  │   │                                          │                  │ │
    │  │   │  Action: VACUUM users;                  │                  │ │
    │  │   │  Action: ANALYZE users;                 │                  │ │
    │  │   │                                          │                  │ │
    │  │   └──────────────────────────────────────────┘                  │ │
    │  └──────────────────────────────────────────────────────────────────┘ │
    │                                                                       │
    │  ┌──────────────────────────────────────────────────────────────────┐ │
    │  │ 3. WAL Writer                                                   │ │
    │  │                                                                   │ │
    │  │ • Writes WAL records to disk periodically                       │ │
    │  │ • Ensures durability of transactions                            │ │
    │  │ • Balances between performance and durability                  │ │
    │  │ • Controlled by wal_writer_delay setting                       │ │
    │  └──────────────────────────────────────────────────────────────────┘ │
    │                                                                       │
    │  ┌──────────────────────────────────────────────────────────────────┐ │
    │  │ 4. Checkpointer                                                  │ │
    │  │                                                                   │ │
    │  │ • Performs checkpoints at regular intervals                     │ │
    │  │ • Writes all dirty buffers to disk                              │ │
    │  │ • Updates pg_control file                                       │ │
    │  │ • Controlled by checkpoint_* parameters                        │ │
    │  └──────────────────────────────────────────────────────────────────┘ │
    │                                                                       │
    │  ┌──────────────────────────────────────────────────────────────────┐ │
    │  │ 5. Background Writer (bgwriter)                                 │ │
    │  │                                                                   │ │
    │  │ • Periodically writes dirty shared buffers                      │ │
    │  │ • Reduces checkpoint load                                       │ │
    │  │ • Controlled by bgwriter_* parameters                          │ │
    │  └──────────────────────────────────────────────────────────────────┘ │
    │                                                                       │
    │  ┌──────────────────────────────────────────────────────────────────┐ │
    │  │ 6. Statistics Collector (stats collector)                       │ │
    │  │                                                                   │ │
    │  │ • Collects statistics about database activity                   │ │
    │  │ • Tracks table access, index usage, row counts                  │ │
    │  │ • Used by pg_stat_* views                                       │ │
    │  └──────────────────────────────────────────────────────────────────┘ │
    │                                                                       │
    │  ┌──────────────────────────────────────────────────────────────────┐ │
    │  │ 7. WAL Archiver (archiver)                                       │ │
    │  │                                                                   │ │
    │  │ • Archives completed WAL segments                               │ │
    │  │ • Used for point-in-time recovery                              │ │
    │  │ • Controlled by archive_* parameters                           │ │
    │  └──────────────────────────────────────────────────────────────────┘ │
    │                                                                       │
    │  ┌──────────────────────────────────────────────────────────────────┐ │
    │  │ 8. Replication Launcher                                         │ │
    │  │                                                                   │ │
    │  │ • Starts WAL sender processes for streaming replication         │ │
    │  │ • Manages replication slots                                    │ │
    │  │ • Controlled by max_wal_senders parameter                      │ │
    │  └──────────────────────────────────────────────────────────────────┘ │
    │                                                                       │
    └──────────────────────────────────────────────────────────────────────┘
    ========================================================================

3.6 Memory Architecture

PostgreSQL uses both shared memory (accessible by all processes) and local memory (per-process).

                    PostgreSQL Memory Architecture
    ========================================================================

    ┌──────────────────────────────────────────────────────────────────────┐
    │                    SYSTEM MEMORY                                     │
    │                                                                       │
    │  ┌────────────────────────────────────────────────────────────────┐  │
    │  │                    SHARED MEMORY                               │  │
    │  │                                                                 │  │
    │  │  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐   │  │
    │  │  │  Shared        │  │    WAL        │  │    Clog       │   │  │
    │  │  │  Buffers       │  │    Buffers     │  │    (XLOG)     │   │  │
    │  │  │                │  │                │  │                │   │  │
    │  │  │ Caches table   │  │  WAL records  │  │  Transaction  │   │  │
    │  │  │ and index data │  │  before write │  │  status bits  │   │  │
    │  │  │                │  │                │  │                │   │  │
    │  │  │ shared_buffers│  │ wal_buffers   │  │ ~2-4 MB        │   │  │
    │  │  │ Default: 128MB │  │ Default: -1   │  │                │   │  │
    │  │  │                │  │ (auto-tuned)  │  │                │   │  │
    │  │  └────────────────┘  └────────────────┘  └────────────────┘   │  │
    │  │                                                                 │  │
    │  │  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐   │  │
    │  │  │    Lock        │  │  ValidF       │  │  Free Space   │   │  │
    │  │  │   Manager      │  │   (VM)         │  │    Map        │   │  │
    │  │  │                │  │                │  │                │   │  │
    │  │  │  Row-level    │  │  Visibility   │  │  Track free   │   │  │
    │  │  │  locks        │  │  map          │  │  space        │   │  │
    │  │  │                │  │                │  │                │   │  │
    │  │  │  ~2-256MB     │  │  ~1% of heap  │  │  ~1% of heap  │   │  │
    │  │  └────────────────┘  └────────────────┘  └────────────────┘   │  │
    │  │                                                                 │  │
    │  └────────────────────────────────────────────────────────────────┘  │
    │                                                                       │
    │  ┌────────────────────────────────────────────────────────────────┐  │
    │  │              LOCAL MEMORY (per backend process)                 │  │
    │  │                                                                 │  │
    │  │  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐   │  │
    │  │  │   Sort/Hash   │  │   Hash Join   │  │   Bitmap      │   │  │
    │  │  │   Memory       │  │   Memory       │  │   Memory      │   │  │
    │  │  │                │  │                │  │                │   │  │
    │  │  │ work_mem: 4MB │  │ work_mem: 4MB │  │ work_mem: 4MB │   │  │
    │  │  │                │  │                │  │                │   │  │
    │  │  │ Used for:      │  │ Used for:      │  │ Used for:     │   │  │
    │  │  │ - ORDER BY    │  │ - Hash Join   │  │ - Bitmap Scan │   │  │
    │  │  │ - DISTINCT    │  │ - Hash Agg    │  │ - Index Only  │   │  │
    │  │  │ - GROUP BY    │  │ - Set Ops     │  │   Scan        │   │  │
    │  │  └────────────────┘  └────────────────┘  └────────────────┘   │  │
    │  │                                                                 │  │
    │  │  ┌────────────────┐  ┌────────────────┐                      │  │
    │  │  │  Maintenance   │  │   Logical      │                      │  │
    │  │  │  Memory        │  │   Decoding     │                      │  │
    │  │  │                │  │   Work Mem     │                      │  │
    │  │  │ maintenance   │  │                │                      │  │
    │  │  │ _work_mem: 64M │  │ ~10MB          │                      │  │
    │  │  │                │  │                │                      │  │
    │  │  │ Used for:      │  │ Used for:      │                      │  │
    │  │  │ - VACUUM      │  │ - Decode       │                      │  │
    │  │  │ - CREATE INDEX│  │   changes      │                      │  │
    │  │  │ - ALTER TABLE │  │   for          │                      │  │
    │  │  └────────────────┘  │   replication  │                      │  │
    │  │                       └────────────────┘                      │  │
    │  └────────────────────────────────────────────────────────────────┘  │
    │                                                                       │
    └──────────────────────────────────────────────────────────────────────┘
    ========================================================================

3.7 Storage Architecture

PostgreSQL stores data in files on disk with a specific organization.

                    PostgreSQL Storage Architecture
    ========================================================================

    Database Cluster Directory Structure:
    =====================================

    $PGDATA/
    ├── base/              -- Tablespace for default tablespace
    │   ├── 1/            -- Database OID 1 (template0)
    │   ├── 16384/        -- Database OID 16384 (postgres)
    │   └── ...
    │
    ├── global/           -- System catalogs (pg_database, etc.)
    │
    ├── pg_wal/           -- Write-Ahead Log (WAL)
    │   ├── 000000010000000000000001
    │   └── ...
    │
    ├── pg_xact/          -- Transaction commit log (Clog)
    │   ├── 0000
    │   └── ...
    │
    ├── pg_notify/        -- LISTEN/NOTIFY state
    │
    ├── pg_serial/        -- Serializable transaction information
    │
    ├── pg_snapshots/    -- Exported snapshots
    │
    ├── pg_subtrans/     -- Subtransaction status
    │
    ├── pg_twophase/     -- Two-phase commit state
    │
    ├── pg_dynshmem/     -- Dynamic shared memory
    │
    ├── pg_replslot/     -- Replication slots
    │
    ├── pg_tblspc/       -- Symbolic links to tablespaces
    │
    ├── pg_stat/         -- Statistics files
    │
    ├── pg_stat_tmp/    -- Statistics temporary files
    │
    ├── pg_log/         -- Log files (if logging_collector is on)
    │
    └── postgresql.conf  -- Configuration file


    Per-Database Directory (base/16384/):
    ======================================

    16384/
    ├── 1255              -- Table: users (relfilenode)
    ├── 1255_vm          -- Visibility map for users
    ├── 1255_fsm         -- Free space map for users
    ├── 1255_init        -- Initialization fork for users
    ├── 1259              -- Index: users_pkey
    ├── 1259_vm
    ├── 1259_fsm
    ├── 16385             -- Table: orders
    ├── 16386             -- Index: orders_pkey
    ├── ...               -- More tables and indexes
    └── PG_VERSION       -- PostgreSQL version file


    Page/File Structure:
    ====================

    ┌─────────────────────────────────────────────────────────────────┐
    │                    8KB Page Layout                              │
    │                                                                  │
    │  ┌──────────────────────────────────────────────────────────┐ │
    │  │ PageHeaderData (24 bytes)                                 │ │
    │  │ ├── pd_lsn: WAL location of this page                    │ │
    │  │ ├── pd_tli: Timeline ID                                   │ │
    │  │ ├── pd_flags: Page type flags                             │ │
    │  │ ├── pd_lower: Offset to free space start                  │ │
    │  │ ├── pd_upper: Offset to free space end                    │ │
    │  │ ├── pd_special: Offset to special space                  │ │
    │  │ └── pd_pagesize_version: Page size and version            │ │
    │  └──────────────────────────────────────────────────────────┘ │
    │                              │                                   │
    │                              ▼                                   │
    │  ┌──────────────────────────────────────────────────────────┐ │
    │  │ ItemIdData (each 4 bytes)                                 │ │
    │  │ ├── lp_off: Offset to tuple start                         │ │
    │  │ ├── lp_len: Length of tuple                              │ │
    │  │ └── lp_flags: Status (used/dead/normal)                   │ │
    │  └──────────────────────────────────────────────────────────┘ │
    │                              │                                   │
    │                              ▼                                   │
    │  ┌──────────────────────────────────────────────────────────┐ │
    │  │ Tuple Data (actual row data)                              │ │
    │  │                                                          │ │
    │  │ ┌──────────────────────────────────────────────────────┐ │ │
    │  │ │ HeapTupleHeaderData                                   │ │ │
    │  │ │ ├── t_xmin: Transaction that inserted this tuple    │ │ │
    │  │ │ ├── t_xmax: Transaction that deleted/updated        │ │ │
    │  │ │ ├── t_cid: Command ID within transaction             │ │ │
    │  │ │ ├── t_ctid: Current tuple ID (for versioning)       │ │ │
    │  │ │ ├── t_hoff: Offset to data                           │ │ │
    │  │ │ ├── t_bits: Null bitmap (if any columns are NULL)   │ │ │
    │  │ │ └── t_oid: Object ID (for tables with OID)          │ │ │
    │  │ └──────────────────────────────────────────────────────┘ │ │
    │  │                              │                              │ │
    │  │                              ▼                              │ │
    │  │  ┌──────────────────────────────────────────────────────┐ │ │
    │  │  │ Column Values                                        │ │ │
    │  │  │ id (4 bytes) | name (variable) | created_at (8 bytes)│ │ │
    │  │  └──────────────────────────────────────────────────────┘ │ │
    │  └──────────────────────────────────────────────────────────┘ │
    │                              │                                   │
    │                              ▼                                   │
    │  ┌──────────────────────────────────────────────────────────┐ │
    │  │ Special Space (for index-specific data)                  │ │
    │  │ For B-tree: includes tree ordering information           │ │
    │  └──────────────────────────────────────────────────────────┘ │
    │                                                                  │
    └─────────────────────────────────────────────────────────────────┘
    ========================================================================

3.8 Write-Ahead Log (WAL)

WAL is fundamental to PostgreSQL’s durability and recovery mechanisms.

                    Write-Ahead Log (WAL) System
    ========================================================================

    WAL Purpose:
    ============

    ┌─────────────────────────────────────────────────────────────────────┐
    │                    Transaction Durability                            │
    │                                                                      │
    │  1. When a transaction commits, changes are written to WAL first  │
    │  2. WAL records are flushed to disk before returning to client   │
    │  3. Data files are modified asynchronously                       │
    │  4. After crash, WAL is replayed to recover changes               │
    │                                                                      │
    └─────────────────────────────────────────────────────────────────────┘

    WAL Write Flow:
    ===============

    ┌──────────────┐      ┌──────────────┐      ┌──────────────┐
    │  Transaction │      │  WAL Buffer  │      │    WAL       │
    │   Commit     │─────▶│  (memory)    │─────▶│    Files     │
    │              │      │              │      │   (disk)     │
    └──────────────┘      └──────────────┘      └──────────────┘
                               │                         │
                               │                         │
                               │                         ▼
                               │                ┌──────────────┐
                               │                │   pg_wal/    │
                               │                │              │
                               │                │ 0000000100   │
                               │                │ 0000000100   │
                               │                │ 0000000100   │
                               │                └──────────────┘
                               │
                               │ Return "COMMIT" to client
                               │ (after WAL flush)
                               ▼
                          Success!


    Checkpoint Process:
    ===================

    ┌─────────────────────────────────────────────────────────────────────┐
    │                        Checkpoint                                   │
    │                                                                      │
    │  Every checkpoint (controlled by checkpoint_timeout):              │
    │                                                                      │
    │  1. Write all dirty buffers to data files                         │
    │  2. Write checkpoint record to WAL                                │
    │  3. Update pg_control file                                        │
    │                                                                      │
    │  pg_control contains:                                             │
    │  - Latest checkpoint location                                     │
    │  - Database cluster state                                         │
    │  - WAL segment location                                           │
    │                                                                      │
    └─────────────────────────────────────────────────────────────────────┘

    Recovery Process:
    =================

    ┌─────────────────────────────────────────────────────────────────────┐
    │                     Crash Recovery Flow                             │
    │                                                                      │
    │  Start: PostgreSQL detects unclean shutdown                       │
    │    │                                                                │
    │    ▼                                                                │
    │  ┌──────────────────────────────────────────────────────────────┐ │
    │  │ 1. Read pg_control file                                      │ │
    │  │    - Find location of last checkpoint                        │ │
    │  └──────────────────────────────────────────────────────────────┘ │
    │    │                                                                │
    │    ▼                                                                │
    │  ┌──────────────────────────────────────────────────────────────┐ │
    │  │ 2. Scan WAL from checkpoint                                   │ │
    │  │    - Redo all transactions marked as committed               │ │
    │  │    - Undo all transactions marked as aborted                 │ │
    │  │                                                               │ │
    │  │  ┌────────────────────────────────────────────────────────┐ │ │
    │  │  │ WAL Record Types:                                        │ │ │
    │  │  │  - XLOG::Transaction::COMMIT                             │ │ │
    │  │  │  - XLOG::Transaction::ABORT                             │ │ │
    │  │  │  - XLOG::Heap::INSERT                                   │ │ │
    │  │  │  - XLOG::Heap::UPDATE                                    │ │ │
    │  │  │  - XLOG::Heap::DELETE                                    │ │ │
    │  │  │  - XLOG::Btree::INSERT                                  │ │ │
    │  │  └────────────────────────────────────────────────────────┘ │ │
    │  └──────────────────────────────────────────────────────────────┘ │
    │    │                                                                │
    │    ▼                                                                │
    │  ┌──────────────────────────────────────────────────────────────┐ │
    │  │ 3. Open database for connections                             │ │
    │  │    - Database is now consistent                             │ │
    │  └──────────────────────────────────────────────────────────────┘ │
    │    │                                                                │
    │    ▼                                                                │
    │  Ready!                                                             │
    │                                                                      │
    └─────────────────────────────────────────────────────────────────────┘
    ========================================================================

3.9 MVCC Implementation

PostgreSQL uses Multi-Version Concurrency Control to allow concurrent transactions without locking.

                    MVCC Implementation in PostgreSQL
    ========================================================================

    How MVCC Works:
    ===============

    ┌─────────────────────────────────────────────────────────────────────┐
    │                    Tuple Version Chain                              │
    │                                                                      │
    │  Each tuple (row) has two transaction IDs:                        │
    │                                                                      │
    │  ┌─────────────────────────────────────────────────────────────┐  │
    │  │  t_xmin: Transaction that INSERTED this tuple              │  │
    │  │  t_xmax: Transaction that DELETED/UPDATED this tuple        │  │
    │  │  t_cid: Command ID within the transaction                   │  │
    │  │  t_ctid: Pointer to newer version of this tuple             │  │
    │  └─────────────────────────────────────────────────────────────┘  │
    │                                                                      │
    │  Example: Update a row                                             │
    │                                                                      │
    │  Before:                                                           │
    │  ┌──────────────────────────────────────────────────────────┐     │
    │  │ users (id=1, name='Alice', t_xmin=100, t_xmax=0)       │     │
    │  └──────────────────────────────────────────────────────────┘     │
    │                                                                      │
    │  After UPDATE:                                                     │
    │  ┌──────────────────────────────────────────────────────────┐     │
    │  │ users (id=1, name='Alice', t_xmin=100, t_xmax=101)      │     │
    │  │          │                             │                  │     │
    │  │          │ t_ctid ────────────────────►│                  │     │
    │  │          │                             │                  │     │
    │  │ users (id=1, name='Alice Smith', t_xmin=101, t_xmax=0) │     │
    │  └──────────────────────────────────────────────────────────┘     │
    │                                                                      │
    └─────────────────────────────────────────────────────────────────────┘

    Visibility Check:
    =================

    ┌─────────────────────────────────────────────────────────────────────┐
    │                 Transaction Visibility Rules                        │
    │                                                                      │
    │  A tuple is visible to transaction T if:                          │
    │                                                                      │
    │  1. t_xmin is COMMITTED and NOT in T's snapshot                   │
    │     AND                                                            │
    │  2. t_xmax is neither COMMITTED nor valid (not updated/deleted)  │
    │     OR t_xmax is in T's snapshot (T updated it itself)           │
    │                                                                      │
    └─────────────────────────────────────────────────────────────────────┘

    Transaction Isolation Levels:
    =============================

    ┌─────────────────────────────────────────────────────────────────────┐
    │     Isolation Level          │   Behavior                          │
    ├──────────────────────────────┼────────────────────────────────────┤
    │ READ UNCOMMITTED (not         │ Can see uncommitted changes       │
    │ actually implemented)        │ (rarely used in PG)              │
    ├──────────────────────────────┼────────────────────────────────────┤
    │ READ COMMITTED (default)     │ See only committed changes        │
    │                              │ from other transactions            │
    ├──────────────────────────────┼────────────────────────────────────┤
    │ REPEATABLE READ              │ See consistent snapshot as of     │
    │                              │ transaction start                 │
    ├──────────────────────────────┼────────────────────────────────────┤
    │ SERIALIZABLE                 │ Full MVCC isolation, ensures      │
    │                              │ equivalent to serial execution    │
    └──────────────────────────────┴────────────────────────────────────┘

    Example: READ COMMITTED vs REPEATABLE READ
    ==========================================

    ┌─────────────────────────────────────────────────────────────────────┐
    │ Transaction A (READ COMMITTED)                                     │
    │ ─────────────────────────────────────                              │
    │ BEGIN;                                                              │
    │ SELECT * FROM users WHERE id = 1;  -- sees Alice                  │
    │                                         ┌─────────────────────────┐ │
    │                                         │ Transaction B          │ │
    │                                         │ ─────────────────────── │ │
    │                                         │ UPDATE users          │ │
    │                                         │ SET name = 'Bob'       │ │
    │                                         │ WHERE id = 1;          │ │
    │                                         │ COMMIT;                │ │
    │                                         └─────────────────────────┘ │
    │ SELECT * FROM users WHERE id = 1;  -- now sees Bob!              │
    └─────────────────────────────────────────────────────────────────────┘

    ┌─────────────────────────────────────────────────────────────────────┐
    │ Transaction A (REPEATABLE READ)                                    │
    │ ─────────────────────────────────────                              │
    │ BEGIN;                                                              │
    │ SELECT * FROM users WHERE id = 1;  -- sees Alice                 │
    │                                         ┌─────────────────────────┐ │
    │                                         │ Transaction B          │ │
    │                                         │ ─────────────────────── │ │
    │                                         │ UPDATE users           │ │
    │                                         │ SET name = 'Bob'       │ │
    │                                         │ WHERE id = 1;          │ │
    │                                         │ COMMIT;                │ │
    │                                         └─────────────────────────┘ │
    │ SELECT * FROM users WHERE id = 1;  -- still sees Alice!         │
    └─────────────────────────────────────────────────────────────────────┘
    ========================================================================

3.10 Best Practices

                    Architecture Best Practices
    ========================================================================

    1. Memory Configuration
       ┌──────────────────────────────────────────────────────────────┐
       │  • Set shared_buffers to 25% of available RAM               │
       │  • Adjust work_mem based on query complexity                 │
       │  • Use maintenance_work_mem for VACUUM and INDEX           │
       └──────────────────────────────────────────────────────────────┘

    2. WAL Configuration
       ┌──────────────────────────────────────────────────────────────┐
       │  • Use dedicated disk for WAL (SSD recommended)             │
       │  • Set appropriate checkpoint intervals                     │
       │  • Enable WAL archiving for production                     │
       └──────────────────────────────────────────────────────────────┘

    3. Process Management
       ┌──────────────────────────────────────────────────────────────┐
       │  • Configure max_connections appropriately                  │
       │  • Use connection pooler for high traffic                   │
       │  • Monitor backend process count                            │
       └──────────────────────────────────────────────────────────────┘

    4. Monitoring
       ┌──────────────────────────────────────────────────────────────┐
       │  • Enable pg_stat_statements extension                      │
       │  • Monitor shared buffer hit ratio                          │
       │  • Track query execution times                              │
       └──────────────────────────────────────────────────────────────┘
    ========================================================================

Next Chapter

Chapter 4: Database Objects & Concepts

Last Updated: February 2026