Lavender Data

Load & evolve datasets efficiently

Get started View on GitHub

Lavender Data is a data pipeline framework built by fal.ai.

Joinable Dataset

Add new features to your dataset without rewriting your data
Selectively load only the features you need for your task

Remote Preprocessing

Preprocess data on a remote server and offload your training GPUs
Load data directly into memory through a network without any disk usage
Support cloud storages

Dynamic Data Loading

Filter rows or columns on the fly
Resume an iteration from where you left off
Retry or skip failed samples to make it fault tolerant
Shuffle data across shards

Web UI

Define and preview your datasets
Track the realtime progress of your iterations

Why Lavender Data?

ML data pipelines often face several challenges:

Inflexible Dataset: Difficulty in adding new features or columns
GPU Overhead: Preprocessing on the same GPU used for training
Poor Fault Tolerance: Failing when a single sample errors out
Disk Space Limitations: Having to store entire datasets on disk
Online Filtering: Difficulty in filtering data on the fly

Lavender Data solves these problems by providing a flexible, efficient, and robust solution for ML data management and preprocessing.

When do I need Lavender Data?

Lavender Data is designed to solve specific challenges in ML data pipelines. Here are the key scenarios where you should consider using Lavender Data.

Dataset is constantly evolving: Add features without reprocessing and organize related features.
GPU utilization bottleneck by preprocessing: Offload preprocessing to separate machines, run in parallel with training.
Working with large-scale datasets: Stream data without disk usage, and work with cloud storage.
Need dynamic data filtering: Filter data during training without affecting batch sizes.
Fault tolerance is critical: Resume interrupted iterations, handle failed samples gracefully (skip/retry).
Distributed environments: Run data-parallel/context-parallel training.
Need better visibility into your data pipeline: Monitor iteration progress, preview/inspect data.

When Lavender Data might NOT be needed

Small datasets that fit comfortably in memory and don’t require complex preprocessing
Static datasets that rarely change or add new features
Simple preprocessing that is fast enough to perform on-the-fly without impacting GPU utilization
Single-file datasets that don’t benefit from Lavender Data’s sharded architecture

Ready to get started? Check out our Quick Start Guide to begin using Lavender Data in your ML workflow.