Dataset - Converters
A Converter is a tool that converts data from a source to a Lavender Data shardset.
import lavender_data.client as lavender
# client initialization is requiredlavender.init()
converter = lavender.Converter.get("plain")converter.to_shardset( iterable, "dataset_name", location=f"file:///path/to/shardset", uid_column_name="id", samples_per_shard=1000, # optional, default is 1000 max_shard_count=None, # optional, default is None)
Plain
The plain
converter is used to convert a generator/iterator of dictionaries to a Lavender Data shardset.
import csv
csv_reader = csv.DictReader(open("path/to/csv"))
converter = lavender.Converter.get("plain")converter.to_shardset( csv_reader, "dataset_name", location=f"file:///path/to/shardset", uid_column_name="id",)
WebDataset
The webdataset
converter is used to convert WebDataset to a Lavender Data shardset.
import webdataset as wds
url = "https://storage.googleapis.com/webdataset/testdata/publaynet-train-{000000..000009}.tar"pil_dataset = wds.WebDataset(url).decode("pil")
converter = lavender.Converter.get("webdataset")converter.to_shardset( pil_dataset, "dataset_name", location=f"file:///path/to/shardset", uid_column_name="id",)
Custom
Or you can implement your own converter by inheriting the Converter
class.
import lavender_data.client as lavender
class CustomConverter(lavender.Converter): def transform(self, sample: dict) -> dict: # Do something with the sample return sample
converter = CustomConverter()converter.to_shardset( samples, "dataset_name", location=f"file:///path/to/shardset", uid_column_name="id",)