Storage Engine

GalaxDB's storage engine is a custom LSM-tree implementation in Rust that unifies transactional row storage, columnar analytics, and vector similarity search. Every design decision is backed by a research paper.

Architecture

┌─────────────────────────────────────────────────────┐
│                   Write Path                         │
│  Client → WAL (LZ4 + XXH3-64) → Memtable (SkipMap) │
│           ↓ (>1KB values)                            │
│         Blob Log (content-addressed)                 │
├─────────────────────────────────────────────────────┤
│                   Flush Path                         │
│  Sealed Memtable → PAX Blocks → SST Files → NVMe    │
│  (AES-256-GCM encryption before write)               │
├─────────────────────────────────────────────────────┤
│                   Read Path                          │
│  ART Index → Bloom Filter → Buffer Pool → PAX Block  │
│  (HotSet for OLTP, ScanBuffer for OLAP)              │
├─────────────────────────────────────────────────────┤
│                   Background                         │
│  Lazy Leveling Compaction + MVCC GC                  │
│  RateLimiter (token bucket) + WriteController        │
│  Disk-full handler (32MB reserve file)               │
└─────────────────────────────────────────────────────┘

PAX Block Format

All data in GalaxDB is stored in PAX (Partition Attributes Across) blocks - a columnar format that stores each column's values contiguously within a block. This enables efficient analytical scans (only read the columns you need) while keeping row data co-located for OLTP point lookups.

┌─────────────────────────────────────────┐
│ Header: magic (0x47414C41), block_id,   │
│   row_count, column descriptors,        │
│   zone maps (min/max per column)        │
├─────────────────────────────────────────┤
│ Column 0: FastPFOR (delta + bitpack)    │
│ Column 1: Zstandard L3                  │
│ Column 2: Raw (embeddings)              │
├─────────────────────────────────────────┤
│ Row Offset Table                        │
├─────────────────────────────────────────┤
│ Footer: XXH3-64 checksum               │
└─────────────────────────────────────────┘

Compression by column type:

Fixed-width integers: delta encoding + bit-packing (FastPFOR) - ~4× compression
Variable-width (TEXT, BLOB): Zstandard level 3 - ~3–5× compression
Embedding columns: no compression (quantization handles it at the HNSW layer)

Zone maps (min/max per column) enable predicate pushdown - the query planner skips entire blocks that can't satisfy a WHERE clause.

Note

The XXH3-64 checksum is verified on every read. Corrupt blocks are rejected immediately with a typed error - no silent data corruption.

Write-Ahead Log

Every write is first appended to the WAL before being applied to the memtable. The WAL record format is:

[type: u8][seq_no: u64][length: u32][xxh3_checksum: u64][lz4_payload]

Six record types: ROW_PUT, ROW_DELETE, DELTA_INSERT, DELTA_TOMBSTONE, CHECKPOINT, BLOB_REF.

Group commit batches writes over a 10 ms window and issues a single fsync per batch, achieving 250K+ TPS. For financial workloads requiring per-commit durability, set DURABILITY STRICT (~5K TPS with fsync per commit).

Recovery: replay from last checkpoint, verify XXH3-64 per record, stop at first corruption. Partial WAL records are never applied.

Memtable

The memtable is a lock-free concurrent skip map (crossbeam-skiplist) with 16 shards and per-key MVCC version chains.

16 shards via xxh3_64(key) % 16 - eliminates cross-shard contention
Seal at 64 MB - atomically swap to a new empty memtable
Back-pressure at 256 MB - block writers when sealed-but-unflushed bytes exceed limit

ART Primary Key Index

The primary key index uses an Adaptive Radix Tree (ART) - a trie-based structure that adapts node sizes (Node4, Node16, Node48, Node256) based on the number of children, with path compression for sparse subtrees.

168 ns/op lookup, measured with 1M keys (AWS c6id.4xlarge, release build)
O(k) lookup where k is key length, independent of dataset size

Bloom filters (Monkey allocation) sit in front of SST files to avoid unnecessary disk reads. The Monkey algorithm optimally allocates filter bits across LSM levels - larger, colder levels get more bits per key, reducing false positives by 40–80% at the same memory budget.

Buffer Pool

The buffer pool is NUMA-aware with two regions and an optional RGABH gradient-driven admission layer (v0.7):

HotSet (70% RAM): LRU eviction for OLTP point lookups
ScanBuffer (30% RAM): Clock-sweep eviction for OLAP sequential scans

RGABH (Reinforcement-Gradient Adaptive Block Heat) - each block carries three exponentially-decaying heat signals (short, long, training). Admission uses W-TinyLFU-style frequency gating: a new block only evicts a resident block if it has proven higher frequency. On a skewed workload this improves HotSet hit rate from 0.639 to 0.803 (+16 pp). Disable with BufferPool::new to revert to exact LRU/clock behavior.

The key invariant is preserved: ScanBuffer NEVER evicts a HotSet-resident block. This is why OLTP latency does not degrade during concurrent OLAP scans - the chaos test suite verifies zero HotSet evictions and unaffected OLTP p99 during a full table scan (see Benchmarks).

Compaction

GalaxDB uses lazy leveling compaction (Dostoevsky design) - tiered upper levels and a leveled bottom level:

L0: tiered (up to 4 files before compaction trigger)
L1–L3: tiered (multiple sorted runs per level)
L4 (bottom): leveled (single sorted run)

MVCC GC runs during merge: versions not needed by active snapshots or pinned tags are discarded. Write stall prevention uses a token-bucket RateLimiter calibrated to 70% of NVMe write bandwidth, with SILK-style flush pre-emption under back-pressure.

Encryption at Rest

Every PAX block and WAL record is encrypted with AES-256-GCM before hitting storage. Key management is pluggable via the KeyProvider trait - see Encryption for configuration details.

Counter-based 96-bit nonces (4-byte random prefix + 8-byte atomic counter) prevent nonce reuse. AES-256-GCM decrypts a 1 MB block in 701 µs (1.43 GB/s); see Benchmarks for the full encryption throughput table, including AEGIS-256.

Overview

Embeddings