Skip to content

LavenderDataLoader - Categorizer

Categorizers groups samples into batches by a certain criterion.

For example, if you have a dataset of images with different aspect ratios, you can use a categorizer to group them into batches of the same aspect ratios.

On the server side, you can define a categorizer like this:

from lavender_data.server import Categorizer
class AspectRatioCategorizer(Categorizer, name="aspect_ratio"):
def categorize(self, sample: dict) -> str:
return f"{sample['width']}x{sample['height']}"

On the client side, you can use the categorizer like this:

dataloader = LavenderDataLoader(
dataset_id=dataset.id,
shardsets=[shardset.id],
categorizer="aspect_ratio",
batch_size=10,
)
batch_1 = next(dataloader)
# batch_1["width"] == [1280, 1280, 1280, ...]
# batch_1["height"] == [720, 720, 720, ...]
batch_2 = next(dataloader)
# batch_2["width"] == [1920, 1920, 1920, ...]
# batch_2["height"] == [1080, 1080, 1080, ...]