Lavender Data is a data pipeline framework
built by fal.ai.
Joinable Dataset
- Add new features to your dataset without rewriting your data
- Selectively load only the features you need for your task
Remote Preprocessing
- Preprocess data on a remote server and offload your training GPUs
- Load data directly into memory through a network without any disk usage
- Support cloud storages
Dynamic Data Loading
- Filter rows or columns on the fly
- Resume an iteration from where you left off
- Retry or skip failed samples to make it fault tolerant
- Shuffle data across shards
Web UI
- Define and preview your datasets
- Track the realtime progress of your iterations
Why Lavender Data?
ML data pipelines often face several challenges:
- Inflexible Dataset: Difficulty in adding new features or columns
- GPU Overhead: Preprocessing on the same GPU used for training
- Poor Fault Tolerance: Failing when a single sample errors out
- Disk Space Limitations: Having to store entire datasets on disk
- Online Filtering: Difficulty in filtering data on the fly
Lavender Data solves these problems by providing a flexible, efficient, and robust solution for ML data management and preprocessing.
When do I need Lavender Data?
Lavender Data is designed to solve specific challenges in ML data pipelines.
Here are the key scenarios where you should consider using Lavender Data.
- Dataset is constantly evolving: Add features without reprocessing and organize related features.
- GPU utilization bottleneck by preprocessing: Offload preprocessing to separate machines, run in parallel with training.
- Working with large-scale datasets: Stream data without disk usage, and work with cloud storage.
- Need dynamic data filtering: Filter data during training without affecting batch sizes.
- Fault tolerance is critical: Resume interrupted iterations, handle failed samples gracefully (skip/retry).
- Distributed environments: Run data-parallel/context-parallel training.
- Need better visibility into your data pipeline: Monitor iteration progress, preview/inspect data.
When Lavender Data might NOT be needed
- Small datasets that fit comfortably in memory and don’t require complex preprocessing
- Static datasets that rarely change or add new features
- Simple preprocessing that is fast enough to perform on-the-fly without impacting GPU utilization
- Single-file datasets that don’t benefit from Lavender Data’s sharded architecture
Ready to get started? Check out our Quick Start Guide to begin using Lavender Data in your ML workflow.