Server - Background Preprocess

Start background preprocessing with preprocess_dataset API. Preprocessed results are saved as a new shardset.

These are the parameters you need to specify:

Dataset ID
Source shardset IDs: List of shardsets to preprocess
Preprocessors: List of preprocessors to apply
Batch size: Number of samples to process in each batch
Export columns: List of columns to export
Destination shardset location: Location to save the preprocessed shardset

Web UI
Python

Click the “Preprocess” button in the dataset settings page.

The preprocess job will be enqueued and executed in the background.

import lavender_data.client as lavender

lavender.api.preprocess_dataset(
    # source dataset id
    dataset_id="dataset-id",
    # source shardset ids
    source_shardset_ids=["shardset-id-1", "shardset-id-2"],
    # preprocessors
    preprocessors=[
        lavender.api.IterationPreprocessor(
            name="umt5-encode",
            params={"model_name": "google/umt5-small"},
        )
    ],
    batch_size=8,
    export_columns=["text_embedding"],
    # destination shardset location
    shardset_location="file:///path/to/output/shardset",
)