Skip to content

LavenderDataLoader - Cluster Sync

Lavender Data can be distributed to multiple servers to improve performance and scalability. This allows you to handle larger datasets and higher request loads by spreading the work across several machines.

Clients can interact with the cluster by connecting to any node (head or worker) using its URL. The cluster handles the internal routing and coordination.

if args.node == "head":
api_url = "http://localhost:8000" # Head node URL
elif args.node == "worker":
api_url = "http://localhost:8001" # Worker node URL
lavender.init(api_url=api_url)
# or
iteration = lavender.LavenderDataLoader(
dataset_name="my-dataset",
api_url=api_url, # you can also specify the api_url directly here
)

cluster_sync Option

The cluster_sync parameter in Iteration.from_dataset controls how iteration state is managed across the cluster nodes. Iteration state includes the index of the sample that is being processed, which rank the sample should be processed, and the shuffled order of the samples, etc.

Default value is True if cluster is enabled, and False otherwise.

iteration = lavender.LavenderDataLoader(
dataset_name="my-dataset",
cluster_sync=True, # or False
)

The behavior depends on whether the shard files are identical across all nodes or differ per node. Here’s a breakdown of the different scenarios:

cluster_syncUse multiple server nodes?Processing Behavior
False
(default if cluster is not enabled)
YesπŸ˜• Inefficient
Each shard may be processed multiple times.
False
(default if cluster is not enabled)
Noβœ… Recommended
Each shard processed exactly once per node.
True
(default if cluster is enabled)
Yesβœ… Recommended
Each shard processed exactly once cluster-wide.
True
(default if cluster is enabled)
NoπŸ˜• Inefficient
Unnecessary synchronization.

In summary:

  • Use cluster_sync=False when you use a single server node for the iteration
  • Use cluster_sync=True when you use multiple server nodes for the iteration

If cluster_sync=True, the head node becomes responsible for iteration state management. Thus, to start iteration with cluster_sync=True, you must include the head node in the iteration.