v1.0.0-beta.1 · 740 tests passing

GalaxDB Documentation

GalaxDB is an AI-native database written in Rust. It combines SQL, HNSW vector search, local embeddings, time-travel queries, training export (Lance format), and near-dedup (MinHash LSH) in a single binary.

What is GalaxDB?

GalaxDB is designed for AI and ML workloads that need more than a traditional database. Instead of stitching together a relational database, a vector store, an embedding API, and a training pipeline, GalaxDB provides all of these capabilities in one binary with a single SQL interface.

It speaks the PostgreSQL wire protocol, so any PostgreSQL client — psql, psycopg2, SQLAlchemy, node-postgres — connects without modification.

SQL

-- Create a table with automatic embeddings
CREATE TABLE docs (
    id   INT PRIMARY KEY,
    body TEXT EMBEDDING MODEL 'sentence-transformers/all-MiniLM-L6-v2' DIM 384
);

-- Insert rows — embeddings computed automatically
INSERT INTO docs (id, body) VALUES (1, 'machine learning and neural networks');
INSERT INTO docs (id, body) VALUES (2, 'cooking recipes italian pasta');

-- Semantic search
SELECT id, body FROM docs WHERE SEMANTIC_MATCH(body, 'AI deep learning', 0.4);

-- Time-travel
CREATE VERSION TAG 'v1' FOR TRAINING WITH TRAINING PRECISION 'float32';
SELECT * FROM docs AT VERSION 'v1';

Key Features

Full SQL

Complete AuroraSQL dialect — CREATE, INSERT, UPDATE, DELETE, SELECT with WHERE, joins, and aggregates.

Local Embeddings

Text → vector conversion runs inside the process via a sidecar. No API key, no data leaving your machine.

HNSW Vector Search

recall@10 = 0.990 on SIFT-1M at ef=200. 459 µs mean latency. SEMANTIC_MATCH in any WHERE clause.

Time-Travel

SELECT ... AT VERSION 'tag' to query historical snapshots. Reproducible ML training, EU AI Act compliance.

Training Export

CREATE VERSION TAG ... FOR TRAINING exports a Lance dataset. Zero-copy PyTorch-ready in one SQL command.

Encryption at Rest

AES-256-GCM on every block and WAL record. Pluggable key management: local, env, AWS KMS, Vault.

Quick Links

Installation

curl, Homebrew, Docker, pip, or build from source

5-Minute Quickstart

Create a table, insert data, run semantic search

AuroraSQL Reference

Complete SQL dialect reference

Python Client

Embedded and server mode Python bindings

Benchmarks

Real numbers from AWS c6id.4xlarge

Contributing

How to contribute to GalaxDB

Architecture Overview

GalaxDB is a single Rust binary with an optional sidecar process for embedding computation. The core engine handles SQL parsing (AuroraSQL), storage (LSM-tree with PAX blocks), vector indexing (HNSW), and the PostgreSQL wire protocol.

The embedding sidecar is a separate process that loads HuggingFace sentence-transformer models and serves embedding requests over a local socket. This isolation means the main database process never loads Python or ML frameworks — it stays lean and crash-safe.

┌─────────────────────────────────────────────────────┐
│                  galaxdb-server                      │
│                                                      │
│  PostgreSQL Wire Protocol (port 5433)                │
│  ↓                                                   │
│  AuroraSQL Parser → Executor                         │
│  ↓                                                   │
│  Storage Engine (LSM + PAX + WAL + ART)              │
│  ↓                                                   │
│  HNSW Vector Index                                   │
│  ↓                                                   │
│  HTTP Observability (port 9090)                      │
└─────────────────────────────────────────────────────┘
         ↕ local socket
┌─────────────────────────────────────────────────────┐
│              galaxdb-sidecar (optional)              │
│  HuggingFace sentence-transformer model              │
│  text → float32[384] embeddings                      │
└─────────────────────────────────────────────────────┘

The storage engine is built on 12 components: PAX blocks, WAL, memtable (crossbeam-skiplist), ART primary key index, Bloom filters (Monkey allocation), NUMA-aware buffer pool, lazy leveling compaction (Dostoevsky), KV separation, AES-256-GCM encryption, write stall mitigation, disk-full handling, and statistics collection. Every design decision has a research citation.

Installation