G

Training Export

GalaxDB can export versioned datasets in Lance format — a columnar format optimized for ML training. A single SQL command creates a snapshot and exports it as a PyTorch-ready dataset with zero-copy memory-mapped access.

Overview

The training export workflow:

  1. Create a version tag with FOR TRAINING
  2. Call db.training_dataset('tag') to export the Lance dataset
  3. Load the dataset into PyTorch with lance.dataset(path).to_pytorch()
SQL
-- Step 1: Create training snapshot
CREATE VERSION TAG 'train-v1'
  FOR TRAINING
  WITH TRAINING PRECISION 'float32'
  TRAINING SEED 42;
Python
import galaxdb
import lance
import torch

db = galaxdb.Database("./data")

# Step 2: Export as Lance dataset
path = db.training_dataset("train-v1")
print(f"Dataset at: {path}")

# Step 3: Load into PyTorch — zero-copy, memory-mapped
dataset = lance.dataset(path).to_pytorch()
loader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

for batch in loader:
    # batch is a dict of tensors, one per column
    ids = batch['id']
    embeddings = batch['body']  # float32[384] per row
    # ... train your model

Lance Format

Lance is a columnar format designed for ML workloads. Key properties:

  • Zero-copy reads: PyTorch tensors are memory-mapped directly from disk — no deserialization overhead
  • Columnar layout: read only the columns you need (e.g., just embeddings, not text)
  • Random access: O(1) row access by index, unlike Parquet which requires sequential scan
  • Versioned: each export is a separate Lance dataset directory

The exported dataset lives at <database>/training_exports/<tag>_<timestamp>/. Repeat calls with the same tag overwrite the previous export.

Precision Options

The WITH TRAINING PRECISION clause controls how embedding vectors are stored in the Lance dataset:

PrecisionBytes/dimSize vs float32Use case
float324Full precision, default
sq814× smaller8-bit scalar quantization
rabitq1/3232× smallerBinary quantization, maximum compression

Tip

For most training workloads, float32 is the right choice. Use sq8 when dataset size is a constraint and you can tolerate slight quality loss. rabitq is for extreme compression scenarios.

PyTorch Integration

The Lance PyTorch integration provides an IterableDataset with zero-copy memory-mapped access. This means the training data is read directly from the Lance files without loading the entire dataset into RAM.

Python
import galaxdb
import lance
import torch
from torch.utils.data import DataLoader

db = galaxdb.Database("./data")

# Create snapshot with deduplication
db.execute("""
    CREATE VERSION TAG 'train-deduped'
    FOR TRAINING
    WITH TRAINING PRECISION 'float32'
    TRAINING SEED 42
""")

# Export
path = db.training_dataset("train-deduped")

# Load with Lance
ds = lance.dataset(path)
print(f"Rows: {ds.count_rows()}")
print(f"Schema: {ds.schema}")

# PyTorch DataLoader
pytorch_ds = ds.to_pytorch()
loader = DataLoader(pytorch_ds, batch_size=64, num_workers=4)

for epoch in range(10):
    for batch in loader:
        embeddings = batch['body']  # shape: [64, 384]
        labels = batch['label']     # shape: [64]
        # ... training step

The create_training_snapshot Python method is a convenience wrapper that creates the version tag and returns the timestamp:

Python
# Equivalent to CREATE VERSION TAG ... FOR TRAINING
ts = db.create_training_snapshot('train-v1', seed=42)
print(f"Snapshot at timestamp: {ts}")

path = db.training_dataset('train-v1')
# path: ./data/training_exports/train-v1_1715385600000000/