Near-Dedup (MinHash LSH)

GalaxDB can identify and filter near-duplicate rows using MinHash LSH (Locality-Sensitive Hashing). The WHERE NOT DUPLICATE clause returns one representative row per near-duplicate cluster, typically reducing dataset size by 15–30%.

Overview

Near-duplicate detection is critical for training data quality. Duplicate or near-duplicate examples cause models to overfit to repeated patterns, reducing generalization. Traditional exact deduplication misses paraphrases, reformatted text, and minor edits.

MinHash LSH solves this by computing a compact signature (MinHash) for each document and using LSH to group documents with similar signatures into buckets. Documents in the same bucket are near-duplicates.

MinHash LSH

MinHash estimates Jaccard similarity between documents using random hash functions. For two documents A and B:

Jaccard(A, B) = |A ∩ B| / |A ∪ B| (fraction of shared shingles)
MinHash approximates this with k hash functions in O(n) time
LSH groups documents into bands - documents sharing a band are near-duplicates

GalaxDB uses character-level shingles (n-grams) for the MinHash computation, which is robust to word-level edits and reformatting.

Note

MinHash LSH runs in O(n) time - it scales linearly with dataset size. For 1M rows, deduplication typically completes in seconds.

WHERE NOT DUPLICATE

SQL

-- Select only unique documents (one per near-duplicate cluster)
SELECT * FROM docs WHERE NOT DUPLICATE;

-- Combine with other filters
SELECT id, body
FROM docs
WHERE NOT DUPLICATE
  AND created_at > '2024-01-01';

-- Count unique vs total
SELECT COUNT(*) FROM docs;                    -- total rows
SELECT COUNT(*) FROM docs WHERE NOT DUPLICATE; -- unique rows

The WHERE NOT DUPLICATE clause operates on the text columns in the table. For tables with embedding columns, it uses the text values (not the vectors) for MinHash computation.

Use Cases

Training Data Quality

Deduplicate before creating a training snapshot:

SQL

-- Check deduplication impact
SELECT COUNT(*) FROM training_data;                    -- 1,000,000 rows
SELECT COUNT(*) FROM training_data WHERE NOT DUPLICATE; -- ~750,000 unique rows

-- Create deduplicated training snapshot
-- (WHERE NOT DUPLICATE is applied at export time)
CREATE VERSION TAG 'train-deduped'
  FOR TRAINING
  WITH TRAINING PRECISION 'float32';

Web Crawl Deduplication

SQL

-- Insert crawled pages
BULK INSERT INTO pages (id, url, content) VALUES
  (1, 'https://example.com/page1', 'Introduction to machine learning...'),
  (2, 'https://mirror.com/page1', 'Introduction to machine learning...'),  -- near-duplicate
  (3, 'https://example.com/page2', 'Deep learning with transformers...');

-- Query unique pages only
SELECT id, url, content
FROM pages
WHERE NOT DUPLICATE;
-- Returns rows 1 and 3 (row 2 is a near-duplicate of row 1)

RAG Knowledge Base

Python

import galaxdb

db = galaxdb.Database("./knowledge-base")

# After ingesting documents from multiple sources
total = db.execute("SELECT COUNT(*) FROM docs")[0]['count']
unique = db.execute("SELECT COUNT(*) FROM docs WHERE NOT DUPLICATE")[0]['count']
print(f"Total: {total}, Unique: {unique}, Reduction: {(1 - unique/total)*100:.1f}%")

# Query with deduplication for RAG retrieval
results = db.execute("""
    SELECT id, body
    FROM docs
    WHERE SEMANTIC_MATCH(body, 'machine learning', 0.4)
      AND NOT DUPLICATE
""")
for row in results:
    print(row['body'][:100])

Training Export

RAG & Vector Indexing