Vector Search (HNSW)

GalaxDB uses a custom HNSW (Hierarchical Navigable Small World) implementation for approximate nearest neighbor search. It achieves recall@10 = 0.990 on SIFT-1M at ef=200, with 459 µs mean query latency.

HNSW Overview

HNSW builds a multi-layer graph where each node is a vector. The top layers are sparse (long-range connections) and the bottom layer is dense (short-range connections). Search starts at the top layer and greedily descends to the bottom, narrowing the candidate set at each layer.

Key parameters:

M (default: 16) - number of bidirectional links per node. Higher M = better recall, more memory.
ef_construction (default: 200) - candidate set size during index build. Higher = better recall, slower build.
ef_search - candidate set size during query. Higher = better recall, slower query. Tunable per query.

GalaxDB's HNSW uses cosine similarity (normalized dot product) as the distance metric. Vectors are L2-normalized at insert time, so cosine distance equals Euclidean distance on the unit sphere.

Recall vs ef_search

The ef_search parameter controls the recall/latency tradeoff. These numbers are from the SIFT-1M benchmark on AWS c6id.4xlarge (M=16, ef_construction=200):

ef_search	recall@10	mean latency	p99 latency
10	0.756	57.7 µs	105 µs
50	0.959	156.7 µs	229 µs
100	0.983	266.7 µs	364 µs
200	0.990	458.9 µs	612 µs

Dataset: SIFT-1M (1,000,000 × 128-dim float32 vectors, 10,000 queries). Build time: 65.4 s (15,295 vec/sec). Hardware: AWS c6id.4xlarge, Intel Xeon Platinum 8375C.

SEMANTIC_MATCH

SEMANTIC_MATCH(column, query_text, threshold) is the SQL interface to vector search. It embeds the query text using the same model as the column, then returns rows where cosine similarity ≥ threshold.

SQL

-- Basic semantic search
SELECT id, body
FROM docs
WHERE SEMANTIC_MATCH(body, 'artificial intelligence', 0.4);

-- Threshold guide:
-- 0.8+    → very close matches (near-duplicates)
-- 0.5-0.8 → clearly related content
-- 0.3-0.5 → loosely related, broader results
-- 0.0     → all rows ranked by similarity, no cutoff

Note

SEMANTIC_MATCH requires the server to be started with --sidecar and --model flags. Without the sidecar, it returns a SidecarUnavailable error.

Hybrid Search

Combine SEMANTIC_MATCH with standard SQL predicates for hybrid search - filter by metadata while searching by semantic similarity:

SQL

-- Semantic search + date filter
SELECT id, title, body, created_at
FROM articles
WHERE SEMANTIC_MATCH(body, 'machine learning', 0.5)
  AND created_at > '2024-01-01'
  AND category = 'technology'
ORDER BY id
LIMIT 10;

-- Semantic search across multiple conditions
SELECT id, body
FROM docs
WHERE SEMANTIC_MATCH(body, 'database performance', 0.4)
   OR SEMANTIC_MATCH(body, 'storage engine', 0.4);

DiskANN (disk-resident, opt-in)

For vector sets larger than available RAM. The Vamana graph and full-precision vectors live on disk as fixed-size node records; only the nodes visited during beam search are loaded into a bounded in-memory cache. RAM use stays constant regardless of index size.

Incremental inserts work via a FreshDiskANN-style in-memory delta: a new row is immediately findable without a full rebuild. HNSW remains the default; DiskANN is opt-in.

Verified recall@10 ≥ 0.90 (cosine) and ≥ 0.85 (L2) against exact brute-force ground truth.

Semantic Cache

Repeat queries whose embedding is within a configured cosine similarity of a cached query are served from cache - no sidecar call, no HNSW traversal.

SQL

CREATE SEMANTIC CACHE FOR TABLE docs SIMILARITY 0.97 TTL 300;
DROP SEMANTIC CACHE FOR TABLE docs;

Entries invalidate automatically on INSERT/UPDATE/DELETE. The cache hit counter galaxdb_semantic_cache_hits_total is exposed on /metrics.

Historical Vector Search

Run a semantic search scoped to a past snapshot:

SQL

SELECT id, body
FROM docs
WHERE SEMANTIC_MATCH(body, 'AI deep learning', 0.4)
AT VERSION 'train-v1' CONSISTENCY 'SEMANTIC_SNAPSHOT'
LIMIT 5;

Rows inserted after the snapshot are excluded. The live index is not mutated.

Performance

The HNSW index is built in parallel using Rayon. For 1M vectors at M=16, ef_construction=200:

Build time: 65.4 s (15,295 vec/sec) on 16 vCPU

Every SEMANTIC_MATCH does a two-stage rerank internally: HNSW approximate search returns a candidate set (default 200), then the engine fetches the raw float32 vectors and recomputes exact cosine distance for each candidate. The final top-k is exact, not approximate - no separate reranking step needed for cosine similarity.

The ANALYZE command updates statistics used by the adaptive query planner to decide between HNSW search and brute-force scan. For small tables (<1000 rows), brute-force is often faster than HNSW traversal.

SQL

-- Update statistics for the query planner
ANALYZE docs;
-- ANALYZE docs: 8 rows sampled

Embeddings

Time-Travel