Core Concepts

GalaxDB combines several advanced systems into a single binary. This section explains each major subsystem - how it works, why it was designed this way, and how to use it effectively.

Export versioned datasets in Lance format with one SQL command. Zero-copy PyTorch integration via memory-mapped files. float32, sq8, and rabitq precision.

Near-Dedup (MinHash LSH)

Remove near-duplicate rows from training data using MinHash LSH. WHERE NOT DUPLICATE in any SELECT. Typically reduces dataset size by 15–30%.

RAG & Vector Indexing

End-to-end retrieval-augmented generation in a single SQL query. HNSW, DiskANN, semantic caching, and built-in two-stage reranking.

Semantic Cache

Cache SEMANTIC_MATCH results by query similarity. CREATE SEMANTIC CACHE FOR TABLE t SIMILARITY f TTL n. Invalidated on write.

Serializable Isolation

Opt-in Serializable Snapshot Isolation on top of the default snapshot isolation. Aborts write-skew with SQLSTATE 40001.

Storage Engine

The storage engine is a custom LSM-tree implementation in Rust, designed for mixed OLTP/OLAP workloads on AWS c6id.4xlarge by combining group-commit WAL, columnar PAX blocks, zone-map pruning, and a NUMA-aware dual-region buffer pool. See Benchmarks for measured write throughput against PostgreSQL 16.

Embeddings

Embedding computation runs in a separate sidecar process (galaxdb-sidecar) that loads a HuggingFace sentence-transformer model. When you insert a row into a table with an EMBEDDING MODEL column, the sidecar automatically computes the embedding vector - no extra code, no API calls.

Vector Search

GalaxDB uses a custom HNSW (Hierarchical Navigable Small World) implementation for approximate nearest neighbor search. With M=16 and ef_construction=200, it achieves recall@10 = 0.990 on SIFT-1M. The SEMANTIC_MATCH(column, query, threshold) function integrates vector search directly into SQL WHERE clauses.

Time-Travel

Every CREATE VERSION TAG creates an immutable snapshot of the database at that point in time. You can query any snapshot with SELECT ... FROM table AT VERSION 'tag_name'. Snapshots are lightweight - they reference existing data blocks rather than copying them.

Training Export

Version tags created with FOR TRAINING export the table data as a Lance dataset - a columnar format optimized for ML training. Lance supports zero-copy memory-mapped access, so PyTorch can read training data directly from disk without loading it into RAM.

Near-Dedup

The WHERE NOT DUPLICATE clause uses MinHash LSH to identify and filter near-duplicate rows. This is particularly useful for training data quality - duplicate or near-duplicate examples can cause models to overfit. MinHash LSH runs in O(n) time and typically reduces dataset size by 15–30%.

RAG & Vector Indexing

GalaxDB handles the entire retrieval-augmented generation pipeline in a single SQL statement: embed on write, index with HNSW or DiskANN, search with SEMANTIC_MATCH, and cache results with the semantic cache. The built-in two-stage reranker ensures the final top-k is exact, not approximate - no separate reranking step needed. See RAG & Vector Indexing for the full pipeline including DiskANN, asymmetric encoding, and historical vector search.

Semantic Cache

The semantic cache speeds up repeated or near-identical SEMANTIC_MATCH queries by returning cached result sets instead of re-embedding and re-running HNSW. Configure with CREATE SEMANTIC CACHE FOR TABLE t SIMILARITY 0.97 TTL 300. Cache entries invalidate automatically on INSERT, UPDATE, or DELETE to the table. Hits are tracked via galaxdb_semantic_cache_hits_total in /metrics.

Serializable Isolation

GalaxDB defaults to snapshot isolation. For workloads that need to prevent write-skew anomalies, v0.7 adds opt-in Serializable Snapshot Isolation via BEGIN ISOLATION LEVEL SERIALIZABLE. A commit-time certifier aborts one of two conflicting transactions with SQLSTATE 40001 when write-skew is detected. Existing workloads using plain BEGIN are unaffected.

Quickstart

Storage Engine