Skip to content

Release Notes

0.1.2 (2025-05-20)

Added

Preview media

Preview media files like images, videos in dataset preview.

Preview media

Selectable columns

Select columns and sort them as you like.

Selectable columns

HTTP Storage

Added HTTP/HTTPS storage support. Fetch files from HTTP/HTTPS URLs.

For example,

https://docs.lavenderdata.com/example-dataset/images/shard.00000.csv

shardset_location option for dataset creation

Added shardset_location option so you can create a dataset and the shardset at the same time.

Terminal window
lavender-data client \
datasets create \
--name my_dataset \
--uid-column-name id \
--shardset-location https://docs.lavenderdata.com/example-dataset/images/

0.1.1 (2025-05-19)

Added

overwrite, drop_last options for dataset preprocessing

Fixed

Background Preprocessing

Background preprocessing might not preserve the order of samples.

Only one worker is used for background preprocessing.

Server Daemon

Server does not terminate gracefully when lavender-data server stop is called.

Server is not aware of if all background workers are ready or errored.

LavenderDataLoader

LavenderDataLoader unnecessarily calls /version endpoint multiple time.

0.1.0 (2025-05-15)

Added

Categorizer

Please refer here for more details.

Background preprocessing

Please refer here for more details.

0.0.13 (2025-05-12)

Added

Background worker

Heavy tasks like shardset synchronization and sample processing are now running in background workers not to disturb the main process.

Added LAVENDER_DATA_NUM_WORKERS env to control the number of workers for the background worker.

0.0.10 (2025-05-08)

Added

Delete dataset & shardset

Added delete_dataset and delete_shardset APIs as well as in the UI.

0.0.9 (2025-05-02)

Added

Fault handling

Added fault handling features.

  • skip_on_failure option
  • max_retry_count option

0.0.8 (2025-04-30)

Added

Daemon commands

Added cli command: start stop restart logs Those are daemon-related commands.

start starts the server daemon in background. You can stop, restart with stop, restart commands.

Terminal window
lavender-data server start --init
> lavender-data is running on 0.0.0.0:8000
> UI is running on http://localhost:8000
> API key created: la-...

You can check the logs from daemon with logs command.

Terminal window
lavender-data server logs -n 100
# print last 100 lines of the logs
lavender-data server logs -f
# wait for new logs and print them

Converters

A Converter is a tool that converts data from a source to a Lavender Data shardset.

import webdataset as wds
url = "https://storage.googleapis.com/webdataset/testdata/publaynet-train-{000000..000009}.tar"
pil_dataset = wds.WebDataset(url).decode("pil")
converter = lavender.Converter.get("webdataset")
converter.to_shardset(
pil_dataset,
"dataset_name",
location=f"file:///path/to/shardset",
uid_column_name="id",
)

UI breadcrumb

Added breadcrumb

Fixed

Client sdk interface

Before

from lavender_data.client import api as lavender, Iteration
lavender.init(api_url="")
lavender.get_datasets()
for row in Iteration.from_dataset(...).to_torch_data_loader():
...

After

import lavender_data.client as lavender
lavender.init(api_url="")
lavender.api.get_datasets()
# or
lavender.get_client().get_datasets()
# or (in case you need ignore `lavender.init()`)
lavender.LavenderDataClient(api_url="").get_datasets()
for row in lavender.LavenderDataLoader(...).torch():
...

0.0.7 (2025-04-21)

Added

Cluster

Added cluster features (LAVENDER_DATA_CLUSTER_* envs, cluster_sync option)

Fixed

Turned env API_URL of UI into runtime so it can be injected when running lavender-data server run

Fixed shards not shuffling bug