Skip to content
A glittering, brightly colored logo

Lavender Data

Load & evolve datasets efficiently
fal fal

Lavender Data is a data pipeline framework built by fal.ai.

Joinable Dataset

  • Add new features to your dataset without rewriting your data
  • Selectively load only the features you need for your task

Remote Preprocessing

  • Preprocess data on a remote server and offload your training GPUs
  • Load data directly into memory through a network without any disk usage
  • Support cloud storages

Dynamic Data Loading

  • Filter rows or columns on the fly
  • Resume an iteration from where you left off
  • Retry or skip failed samples to make it fault tolerant
  • Shuffle data across shards

Web UI

  • Define and preview your datasets
  • Track the realtime progress of your iterations

Why Lavender Data?

ML data pipelines often face several challenges:

  1. Inflexible Dataset: Difficulty in adding new features or columns
  2. GPU Overhead: Preprocessing on the same GPU used for training
  3. Poor Fault Tolerance: Failing when a single sample errors out
  4. Disk Space Limitations: Having to store entire datasets on disk
  5. Online Filtering: Difficulty in filtering data on the fly

Lavender Data solves these problems by providing a flexible, efficient, and robust solution for ML data management and preprocessing.

When do I need Lavender Data?

Lavender Data is designed to solve specific challenges in ML data pipelines. Here are the key scenarios where you should consider using Lavender Data.

  1. Dataset is constantly evolving: Add features without reprocessing and organize related features.
  2. GPU utilization bottleneck by preprocessing: Offload preprocessing to separate machines, run in parallel with training.
  3. Working with large-scale datasets: Stream data without disk usage, and work with cloud storage.
  4. Need dynamic data filtering: Filter data during training without affecting batch sizes.
  5. Fault tolerance is critical: Resume interrupted iterations, handle failed samples gracefully (skip/retry).
  6. Distributed environments: Run data-parallel/context-parallel training.
  7. Need better visibility into your data pipeline: Monitor iteration progress, preview/inspect data.

When Lavender Data might NOT be needed

  1. Small datasets that fit comfortably in memory and don’t require complex preprocessing
  2. Static datasets that rarely change or add new features
  3. Simple preprocessing that is fast enough to perform on-the-fly without impacting GPU utilization
  4. Single-file datasets that don’t benefit from Lavender Data’s sharded architecture

Ready to get started? Check out our Quick Start Guide to begin using Lavender Data in your ML workflow.