Storage Engine
GalaxDB's storage engine is a custom LSM-tree implementation in Rust that unifies transactional row storage, columnar analytics, and vector similarity search. Every design decision is backed by a research paper.
Architecture
┌─────────────────────────────────────────────────────┐
│ Write Path │
│ Client → WAL (LZ4 + XXH3-64) → Memtable (SkipMap) │
│ ↓ (>1KB values) │
│ Blob Log (content-addressed) │
├─────────────────────────────────────────────────────┤
│ Flush Path │
│ Sealed Memtable → PAX Blocks → SST Files → NVMe │
│ (AES-256-GCM encryption before write) │
├─────────────────────────────────────────────────────┤
│ Read Path │
│ ART Index → Bloom Filter → Buffer Pool → PAX Block │
│ (HotSet for OLTP, ScanBuffer for OLAP) │
├─────────────────────────────────────────────────────┤
│ Background │
│ Lazy Leveling Compaction + MVCC GC │
│ RateLimiter (token bucket) + WriteController │
│ Disk-full handler (32MB reserve file) │
└─────────────────────────────────────────────────────┘PAX Block Format
All data in GalaxDB is stored in PAX (Partition Attributes Across) blocks — a columnar format that stores each column's values contiguously within a block. This enables efficient analytical scans (only read the columns you need) while keeping row data co-located for OLTP point lookups.
┌─────────────────────────────────────────┐
│ Header: magic (0x47414C41), block_id, │
│ row_count, column descriptors, │
│ zone maps (min/max per column) │
├─────────────────────────────────────────┤
│ Column 0: FastPFOR (delta + bitpack) │
│ Column 1: Zstandard L3 │
│ Column 2: Raw (embeddings) │
├─────────────────────────────────────────┤
│ Row Offset Table │
├─────────────────────────────────────────┤
│ Footer: XXH3-64 checksum │
└─────────────────────────────────────────┘Compression by column type:
- Fixed-width integers: delta encoding + bit-packing (FastPFOR) — ~4× compression
- Variable-width (TEXT, BLOB): Zstandard level 3 — ~3–5× compression
- Embedding columns: no compression (quantization handles it at the HNSW layer)
Zone maps (min/max per column) enable predicate pushdown — the query planner skips entire blocks that can't satisfy a WHERE clause. This achieves 80% zone-map skip rate on typical analytical workloads.
Note
Write-Ahead Log
Every write is first appended to the WAL before being applied to the memtable. The WAL record format is:
[type: u8][seq_no: u64][length: u32][xxh3_checksum: u64][lz4_payload]Six record types: ROW_PUT, ROW_DELETE, DELTA_INSERT, DELTA_TOMBSTONE, CHECKPOINT, BLOB_REF.
Group commit batches writes over a 10 ms window and issues a single fsync per batch, achieving 250K+ TPS. For financial workloads requiring per-commit durability, set DURABILITY STRICT (~5K TPS with fsync per commit).
Recovery: replay from last checkpoint, verify XXH3-64 per record, stop at first corruption. Partial WAL records are never applied.
Memtable
The memtable is a lock-free concurrent skip map (crossbeam-skiplist) with 16 shards and per-key MVCC version chains.
- 16 shards via
xxh3_64(key) % 16— eliminates cross-shard contention - Seal at 64 MB — atomically swap to a new empty memtable
- Back-pressure at 256 MB — block writers when sealed-but-unflushed bytes exceed limit
ART Primary Key Index
The primary key index uses an Adaptive Radix Tree (ART) — a trie-based structure that adapts node sizes (Node4, Node16, Node48, Node256) based on the number of children, with path compression for sparse subtrees.
- 213 ns/lookup (sequential keys, warm cache)
- 752 ns/lookup (random keys, warm cache)
- O(k) lookup where k is key length, independent of dataset size
Bloom filters (Monkey allocation) sit in front of SST files to avoid unnecessary disk reads. The Monkey algorithm optimally allocates filter bits across LSM levels — larger, colder levels get more bits per key, reducing false positives by 40–80% at the same memory budget.
Buffer Pool
The buffer pool is NUMA-aware with two regions:
- HotSet (70% RAM): LRU eviction for OLTP point lookups
- ScanBuffer (30% RAM): Clock-sweep eviction for OLAP sequential scans
The key invariant: ScanBuffer NEVER evicts a HotSet-resident block. This is why OLTP p99 latency doesn't degrade during concurrent OLAP scans — measured at 191 µs OLTP p99 with 0 HotSet evictions during a full table scan.
Compaction
GalaxDB uses lazy leveling compaction (Dostoevsky design) — tiered upper levels and a leveled bottom level:
- L0: tiered (up to 4 files before compaction trigger)
- L1–L3: tiered (multiple sorted runs per level)
- L4 (bottom): leveled (single sorted run)
MVCC GC runs during merge: versions not needed by active snapshots or pinned tags are discarded. Write stall prevention uses a token-bucket RateLimiter calibrated to 70% of NVMe write bandwidth, with SILK-style flush pre-emption under back-pressure.
Encryption at Rest
Every PAX block and WAL record is encrypted with AES-256-GCM before hitting storage. Key management is pluggable via the KeyProvider trait — see Encryption for configuration details.
Counter-based 96-bit nonces (4-byte random prefix + 8-byte atomic counter) prevent nonce reuse. AES-NI acceleration achieves 680 MB/s encrypt, 709 MB/s decrypt.