Skip to content

LavenderDataLoader - Shuffle

Shuffling is essential for many ML training scenarios. Lavender Data provides powerful options for shuffling your data across shards.

Basic Shuffling

Set shuffle=True to shuffle the data.

dataloader = LavenderDataLoader(
dataset_id=dataset.id,
shardsets=[shardset.id],
shuffle=True,
)

Controlling Randomness

To ensure reproducibility, you can set a seed for the shuffle.

dataloader = LavenderDataLoader(
dataset_id=dataset.id,
shardsets=[shardset.id],
shuffle=True,
shuffle_seed=42, # Fixed seed for reproducibility
)

Block Size Control

The shuffle_block_size parameter controls how many shards are shuffled at once. Larger values provide better randomness but use more memory:

dataloader = LavenderDataLoader(
dataset_id=dataset.id,
shardsets=[shardset.id],
shuffle=True,
shuffle_seed=42,
shuffle_block_size=3, # Shuffle 3 shards at a time
)