Skip to content

Core Concepts

Dataset Server

You might need to preprocess your data on the training nodes which takes a lot of time and resources.

Client-Server before

Lavender data introduces a client-server architecture to offload data preprocessing from your training pipeline.

Client-Server

The server is responsible for:

  1. Manage datasets & shardsets metadata
  2. Manage iterations, determine which samples to load
  3. Preprocess and cache data
  4. Serve preprocessed data to trainer nodes

Shardset

Normally, a dataset is a single file containing all the data, or multiple files (a.k.a. shards) with the same schema.

Shardset before

This might be problematic when you want to add a new feature, or you don’t want to load a certain large column.

  1. I just want to add this tiny feature

    What if you want to add a 10MB column but the parquet files are 100GB?

  2. I don’t need that column

    What if you don’t need to load caption_embeddings and it consumes 99% of the file size?

Lavender data introduces a shardset layer to solve this problem. A Shardset represents a collection of related columns (features) stored as Shards (files).

Shardset after
  1. I just want to add this tiny feature

    Add a new shardset!

  2. I don’t need that column

    Use only desired shardsets during the training!