Dataset - Shardsets Constraints
In Lavender Data, different features (columns) can be organized into separate shardsets. This provides several advantages:
- Selective Loading: Load only the features you need for a specific task
- Easy Feature Addition: Add new features without modifying existing data
- Optimized Storage: Group related features together for efficient access
To archive the above goals, we have the following constraints.
To understand these constraints, let’s look at an example dataset with two shardsets, shardset_1
and shardset_2
.
shardset_1
has 100 shards and each shard has 100 samples.
shardset_2
has 100 shards and each shard has 90 samples.
Directory/shardset_1/
- shard.00000.csv
- shard.00001.csv
- …
Directory/shardset_2/
- shard.00000.csv
- shard.00001.csv
- …
Common UID column
Each sample in a dataset is identified by a unique ID column, specified when creating the dataset. This UID column must be present in all shardsets of a dataset. It is used to join data across different shardsets and maintains consistency when adding new features.
In above example, uid
is the common UID column and must be present in all shardsets.
uid,image_url0,https://example.com/image-00000.jpg1,https://example.com/image-00001.jpg2,https://example.com/image-00002.jpg
uid,caption0,Caption for image 000001,Caption for image 000012,Caption for image 00002
Sample-Shard Consistency
Because one sample can be present in multiple shards in different shardsets, lavender data needs to merge these shards to yield a sample. However, we don’t want to load all the shards into memory to read a single sample.
To solve this, lavender data uses a sample-shard mapping to map each sample to the shards it is present in.
uid | shard_index |
---|---|
0 | 0 |
1 | 0 |
… | … |
100 | 1 |
101 | 1 |
… | … |
In above example, sample with uid 0 must be present in shard 0 of both shardsets.
uid 0 -> shard 0 of shardset_1 -> /shardset_1/shard.00000.csv -> shard 0 of shardset_2 -> /shardset_2/shard.00000.csv
This way, we can read a sample by loading shards at most the number of shardsets into memory. This is possible only if the sample-shard mapping is consistent across all shardsets. Therefore, the sample-shard mapping must be consistent across all shardsets.
Though, some samples may not be present in all shardsets.
In above example, we assume that shardset_2
has only 90 samples while shardset_1
has 100.
Let’s say sample 90 - 99 are missing in shardset_2
.
Even in this case, the sample-shard mapping must be consistent like below.
uid | shard index in shardset_1 | shard index in shardset_2 |
---|---|---|
89 | 0 | 0 |
90 | 0 | (missing) |
… | … | … |
99 | 0 | (missing) |
100 | 1 | 1 |
101 | 1 | 1 |
… | … | … |