Version Tags

Version tags are named snapshots of the database at a specific point in time. They enable time-travel queries (AT VERSION) and training data export (FOR TRAINING).

CREATE VERSION TAG

SQL

CREATE VERSION TAG 'tag_name'
  [FOR TRAINING
   [WITH TRAINING PRECISION 'float32' | 'sq8' | 'rabitq']
   [TRAINING SEED n]];

Creates an immutable snapshot of the current database state. The tag name must be unique. Tags are lightweight - they reference existing data blocks rather than copying them.

SQL

-- Simple snapshot
CREATE VERSION TAG 'before-migration';

-- Training snapshot with default precision (float32)
CREATE VERSION TAG 'train-v1' FOR TRAINING;

-- Training snapshot with quantization
CREATE VERSION TAG 'train-v1-sq8'
  FOR TRAINING
  WITH TRAINING PRECISION 'sq8';

-- Training snapshot with seed for reproducibility
CREATE VERSION TAG 'experiment-42'
  FOR TRAINING
  WITH TRAINING PRECISION 'float32'
  TRAINING SEED 42;

FOR TRAINING Options

The FOR TRAINING clause exports the snapshot as a Lance dataset, accessible via db.training_dataset('tag') in Python.

Option	Values	Description
TRAINING PRECISION	float32 (default), sq8, rabitq	Embedding vector precision in Lance dataset
TRAINING SEED	uint64	Random seed for reproducible dataset shuffling

AT VERSION

SQL

SELECT ... FROM table [WHERE ...] AT VERSION 'tag_name';
SELECT ... FROM table [WHERE ...] AT VERSION timestamp_uint64;

Queries the table as it existed when the tag was created. The timestamp form accepts a uint64 Unix timestamp in microseconds.

SQL

-- Query by tag name
SELECT * FROM docs AT VERSION 'train-v1';

-- Query by timestamp
SELECT * FROM docs AT VERSION 1715385600000000;

-- Combine with WHERE - AT VERSION is always the final clause
SELECT id, body
FROM docs
WHERE SEMANTIC_MATCH(body, 'machine learning', 0.4)
AT VERSION 'train-v1';

Warning

AT VERSION must be the final clause - putting WHERE after it is a typed parse error.

Note

AT VERSION queries are read-only. INSERT, UPDATE, and DELETE against a historical snapshot are not supported.

Examples

Training pipeline

SQL

-- Insert training data
BULK INSERT INTO training_data (id, text, label) VALUES
  (1, 'positive example', 1),
  (2, 'negative example', 0),
  (3, 'another positive', 1);

-- Create training snapshot
CREATE VERSION TAG 'train-2024-01'
  FOR TRAINING
  WITH TRAINING PRECISION 'float32'
  TRAINING SEED 12345;

-- Add more data later
INSERT INTO training_data (id, text, label) VALUES (4, 'new data', 1);

-- The snapshot still has only 3 rows
SELECT COUNT(*) FROM training_data AT VERSION 'train-2024-01';  -- 3
SELECT COUNT(*) FROM training_data;  -- 4

Python training workflow

Python

import galaxdb
import lance
import torch

db = galaxdb.Database("./data")

# Create snapshot
ts = db.create_training_snapshot('train-v1', seed=42)
print(f"Snapshot at: {ts}")

# Export Lance dataset
path = db.training_dataset('train-v1')

# Load into PyTorch
dataset = lance.dataset(path).to_pytorch()
loader = torch.utils.data.DataLoader(dataset, batch_size=32)

for batch in loader:
    embeddings = batch['text']  # float32 tensors
    labels = batch['label']
    # ... training step

Transactions

Roles & Privileges