Training Export
GalaxDB can export versioned datasets in Lance format — a columnar format optimized for ML training. A single SQL command creates a snapshot and exports it as a PyTorch-ready dataset with zero-copy memory-mapped access.
Overview
The training export workflow:
- Create a version tag with
FOR TRAINING - Call
db.training_dataset('tag')to export the Lance dataset - Load the dataset into PyTorch with
lance.dataset(path).to_pytorch()
-- Step 1: Create training snapshot
CREATE VERSION TAG 'train-v1'
FOR TRAINING
WITH TRAINING PRECISION 'float32'
TRAINING SEED 42;import galaxdb
import lance
import torch
db = galaxdb.Database("./data")
# Step 2: Export as Lance dataset
path = db.training_dataset("train-v1")
print(f"Dataset at: {path}")
# Step 3: Load into PyTorch — zero-copy, memory-mapped
dataset = lance.dataset(path).to_pytorch()
loader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
for batch in loader:
# batch is a dict of tensors, one per column
ids = batch['id']
embeddings = batch['body'] # float32[384] per row
# ... train your modelLance Format
Lance is a columnar format designed for ML workloads. Key properties:
- Zero-copy reads: PyTorch tensors are memory-mapped directly from disk — no deserialization overhead
- Columnar layout: read only the columns you need (e.g., just embeddings, not text)
- Random access: O(1) row access by index, unlike Parquet which requires sequential scan
- Versioned: each export is a separate Lance dataset directory
The exported dataset lives at <database>/training_exports/<tag>_<timestamp>/. Repeat calls with the same tag overwrite the previous export.
Precision Options
The WITH TRAINING PRECISION clause controls how embedding vectors are stored in the Lance dataset:
| Precision | Bytes/dim | Size vs float32 | Use case |
|---|---|---|---|
| float32 | 4 | 1× | Full precision, default |
| sq8 | 1 | 4× smaller | 8-bit scalar quantization |
| rabitq | 1/32 | 32× smaller | Binary quantization, maximum compression |
Tip
float32 is the right choice. Use sq8 when dataset size is a constraint and you can tolerate slight quality loss. rabitq is for extreme compression scenarios.PyTorch Integration
The Lance PyTorch integration provides an IterableDataset with zero-copy memory-mapped access. This means the training data is read directly from the Lance files without loading the entire dataset into RAM.
import galaxdb
import lance
import torch
from torch.utils.data import DataLoader
db = galaxdb.Database("./data")
# Create snapshot with deduplication
db.execute("""
CREATE VERSION TAG 'train-deduped'
FOR TRAINING
WITH TRAINING PRECISION 'float32'
TRAINING SEED 42
""")
# Export
path = db.training_dataset("train-deduped")
# Load with Lance
ds = lance.dataset(path)
print(f"Rows: {ds.count_rows()}")
print(f"Schema: {ds.schema}")
# PyTorch DataLoader
pytorch_ds = ds.to_pytorch()
loader = DataLoader(pytorch_ds, batch_size=64, num_workers=4)
for epoch in range(10):
for batch in loader:
embeddings = batch['body'] # shape: [64, 384]
labels = batch['label'] # shape: [64]
# ... training stepThe create_training_snapshot Python method is a convenience wrapper that creates the version tag and returns the timestamp:
# Equivalent to CREATE VERSION TAG ... FOR TRAINING
ts = db.create_training_snapshot('train-v1', seed=42)
print(f"Snapshot at timestamp: {ts}")
path = db.training_dataset('train-v1')
# path: ./data/training_exports/train-v1_1715385600000000/