G

Vector Search (HNSW)

GalaxDB uses a custom HNSW (Hierarchical Navigable Small World) implementation for approximate nearest neighbor search. It achieves recall@10 = 0.990 on SIFT-1M at ef=200, with 459 µs mean query latency.

HNSW Overview

HNSW builds a multi-layer graph where each node is a vector. The top layers are sparse (long-range connections) and the bottom layer is dense (short-range connections). Search starts at the top layer and greedily descends to the bottom, narrowing the candidate set at each layer.

Key parameters:

  • M (default: 16) — number of bidirectional links per node. Higher M = better recall, more memory.
  • ef_construction (default: 200) — candidate set size during index build. Higher = better recall, slower build.
  • ef_search — candidate set size during query. Higher = better recall, slower query. Tunable per query.

GalaxDB's HNSW uses cosine similarity (normalized dot product) as the distance metric. Vectors are L2-normalized at insert time, so cosine distance equals Euclidean distance on the unit sphere.

Recall vs ef_search

The ef_search parameter controls the recall/latency tradeoff. These numbers are from the SIFT-1M benchmark on AWS c6id.4xlarge (M=16, ef_construction=200):

ef_searchrecall@10mean latencyp99 latency
100.76257.6 µs101 µs
500.959158.1 µs228 µs
1000.983266.5 µs364 µs
2000.990459.4 µs616 µs

Dataset: SIFT-1M (1,000,000 × 128-dim float32 vectors, 10,000 queries). Build time: 66.2 s (15,114 vec/sec). Hardware: AWS c6id.4xlarge, Intel Xeon Platinum 8375C.

SEMANTIC_MATCH

SEMANTIC_MATCH(column, query_text, threshold) is the SQL interface to vector search. It embeds the query text using the same model as the column, then returns rows where cosine similarity ≥ threshold.

SQL
-- Basic semantic search
SELECT id, body
FROM docs
WHERE SEMANTIC_MATCH(body, 'artificial intelligence', 0.4);

-- Threshold guide:
-- 0.8+    → very close matches (near-duplicates)
-- 0.5-0.8 → clearly related content
-- 0.3-0.5 → loosely related, broader results
-- 0.0     → all rows ranked by similarity, no cutoff

Note

SEMANTIC_MATCH requires the server to be started with --sidecar and --model flags. Without the sidecar, it returns a SidecarUnavailable error.

Combine SEMANTIC_MATCH with standard SQL predicates for hybrid search — filter by metadata while searching by semantic similarity:

SQL
-- Semantic search + date filter
SELECT id, title, body, created_at
FROM articles
WHERE SEMANTIC_MATCH(body, 'machine learning', 0.5)
  AND created_at > '2024-01-01'
  AND category = 'technology'
ORDER BY id
LIMIT 10;

-- Semantic search across multiple conditions
SELECT id, body
FROM docs
WHERE SEMANTIC_MATCH(body, 'database performance', 0.4)
   OR SEMANTIC_MATCH(body, 'storage engine', 0.4);

Performance

The HNSW index is built in parallel using Rayon. For 1M vectors at M=16, ef_construction=200:

  • Build time: 66.2 s (15,114 vec/sec) on 16 vCPU
  • Index memory: ~2.5 GB for 1M × 128-dim float32 vectors
  • Query throughput: ~2,000 QPS at ef=200 on a single core

The ANALYZE command updates statistics used by the adaptive query planner to decide between HNSW search and brute-force scan. For small tables (<1000 rows), brute-force is often faster than HNSW traversal.

SQL
-- Update statistics for the query planner
ANALYZE docs;
-- ANALYZE docs: 8 rows sampled