G

Core Concepts

GalaxDB combines several advanced systems into a single binary. This section explains each major subsystem — how it works, why it was designed this way, and how to use it effectively.

Storage Engine

The storage engine is a custom LSM-tree implementation in Rust, designed for mixed OLTP/OLAP workloads. It achieves 258K write TPS and 4.49 GB/s scan throughput on AWS c6id.4xlarge by combining group-commit WAL, columnar PAX blocks, zone-map pruning, and a NUMA-aware dual-region buffer pool.

Embeddings

Embedding computation runs in a separate sidecar process (galaxdb-sidecar) that loads a HuggingFace sentence-transformer model. When you insert a row into a table with an EMBEDDING MODEL column, the sidecar automatically computes the embedding vector — no extra code, no API calls.

GalaxDB uses a custom HNSW (Hierarchical Navigable Small World) implementation for approximate nearest neighbor search. With M=16 and ef_construction=200, it achieves recall@10 = 0.990 on SIFT-1M. The SEMANTIC_MATCH(column, query, threshold) function integrates vector search directly into SQL WHERE clauses.

Time-Travel

Every CREATE VERSION TAG creates an immutable snapshot of the database at that point in time. You can query any snapshot with SELECT ... FROM table AT VERSION 'tag_name'. Snapshots are lightweight — they reference existing data blocks rather than copying them.

Training Export

Version tags created with FOR TRAINING export the table data as a Lance dataset — a columnar format optimized for ML training. Lance supports zero-copy memory-mapped access, so PyTorch can read training data directly from disk without loading it into RAM.

Near-Dedup

The WHERE NOT DUPLICATE clause uses MinHash LSH to identify and filter near-duplicate rows. This is particularly useful for training data quality — duplicate or near-duplicate examples can cause models to overfit. MinHash LSH runs in O(n) time and typically reduces dataset size by 15–30%.