Vector Search (HNSW)
GalaxDB uses a custom HNSW (Hierarchical Navigable Small World) implementation for approximate nearest neighbor search. It achieves recall@10 = 0.990 on SIFT-1M at ef=200, with 459 µs mean query latency.
HNSW Overview
HNSW builds a multi-layer graph where each node is a vector. The top layers are sparse (long-range connections) and the bottom layer is dense (short-range connections). Search starts at the top layer and greedily descends to the bottom, narrowing the candidate set at each layer.
Key parameters:
- M (default: 16) — number of bidirectional links per node. Higher M = better recall, more memory.
- ef_construction (default: 200) — candidate set size during index build. Higher = better recall, slower build.
- ef_search — candidate set size during query. Higher = better recall, slower query. Tunable per query.
GalaxDB's HNSW uses cosine similarity (normalized dot product) as the distance metric. Vectors are L2-normalized at insert time, so cosine distance equals Euclidean distance on the unit sphere.
Recall vs ef_search
The ef_search parameter controls the recall/latency tradeoff. These numbers are from the SIFT-1M benchmark on AWS c6id.4xlarge (M=16, ef_construction=200):
| ef_search | recall@10 | mean latency | p99 latency |
|---|---|---|---|
| 10 | 0.762 | 57.6 µs | 101 µs |
| 50 | 0.959 | 158.1 µs | 228 µs |
| 100 | 0.983 | 266.5 µs | 364 µs |
| 200 | 0.990 | 459.4 µs | 616 µs |
Dataset: SIFT-1M (1,000,000 × 128-dim float32 vectors, 10,000 queries). Build time: 66.2 s (15,114 vec/sec). Hardware: AWS c6id.4xlarge, Intel Xeon Platinum 8375C.
SEMANTIC_MATCH
SEMANTIC_MATCH(column, query_text, threshold) is the SQL interface to vector search. It embeds the query text using the same model as the column, then returns rows where cosine similarity ≥ threshold.
-- Basic semantic search
SELECT id, body
FROM docs
WHERE SEMANTIC_MATCH(body, 'artificial intelligence', 0.4);
-- Threshold guide:
-- 0.8+ → very close matches (near-duplicates)
-- 0.5-0.8 → clearly related content
-- 0.3-0.5 → loosely related, broader results
-- 0.0 → all rows ranked by similarity, no cutoffNote
--sidecar and --model flags. Without the sidecar, it returns a SidecarUnavailable error.Hybrid Search
Combine SEMANTIC_MATCH with standard SQL predicates for hybrid search — filter by metadata while searching by semantic similarity:
-- Semantic search + date filter
SELECT id, title, body, created_at
FROM articles
WHERE SEMANTIC_MATCH(body, 'machine learning', 0.5)
AND created_at > '2024-01-01'
AND category = 'technology'
ORDER BY id
LIMIT 10;
-- Semantic search across multiple conditions
SELECT id, body
FROM docs
WHERE SEMANTIC_MATCH(body, 'database performance', 0.4)
OR SEMANTIC_MATCH(body, 'storage engine', 0.4);Performance
The HNSW index is built in parallel using Rayon. For 1M vectors at M=16, ef_construction=200:
- Build time: 66.2 s (15,114 vec/sec) on 16 vCPU
- Index memory: ~2.5 GB for 1M × 128-dim float32 vectors
- Query throughput: ~2,000 QPS at ef=200 on a single core
The ANALYZE command updates statistics used by the adaptive query planner to decide between HNSW search and brute-force scan. For small tables (<1000 rows), brute-force is often faster than HNSW traversal.
-- Update statistics for the query planner
ANALYZE docs;
-- ANALYZE docs: 8 rows sampled