LavenderDataLoader - Cluster Sync
Lavender Data can be distributed to multiple servers to improve performance and scalability. This allows you to handle larger datasets and higher request loads by spreading the work across several machines.
Clients can interact with the cluster by connecting to any node (head or worker) using its URL. The cluster handles the internal routing and coordination.
if args.node == "head": api_url = "http://localhost:8000" # Head node URLelif args.node == "worker": api_url = "http://localhost:8001" # Worker node URL
lavender.init(api_url=api_url)# oriteration = lavender.LavenderDataLoader( dataset_name="my-dataset", api_url=api_url, # you can also specify the api_url directly here)
cluster_sync
Option
The cluster_sync
parameter in Iteration.from_dataset
controls
how iteration state is managed across the cluster nodes.
Iteration state includes the index of the sample that is being processed,
which rank the sample should be processed,
and the shuffled order of the samples, etc.
Default value is True
if cluster is enabled, and False
otherwise.
iteration = lavender.LavenderDataLoader( dataset_name="my-dataset", cluster_sync=True, # or False)
The behavior depends on whether the shard files are identical across all nodes or differ per node. Hereβs a breakdown of the different scenarios:
cluster_sync | Use multiple server nodes? | Processing Behavior |
---|---|---|
False (default if cluster is not enabled) | Yes | π Inefficient Each shard may be processed multiple times. |
False (default if cluster is not enabled) | No | β
Recommended Each shard processed exactly once per node. |
True (default if cluster is enabled) | Yes | β
Recommended Each shard processed exactly once cluster-wide. |
True (default if cluster is enabled) | No | π Inefficient Unnecessary synchronization. |
In summary:
- Use
cluster_sync=False
when you use a single server node for the iteration - Use
cluster_sync=True
when you use multiple server nodes for the iteration
If cluster_sync=True
, the head node becomes responsible for iteration state management.
Thus, to start iteration with cluster_sync=True
, you must include the head node in the iteration.