Dataset - Join Method
Lavender Data uses a flexible and efficient approach to join data across multiple shardsets. The system is designed to handle large datasets by breaking them into smaller, manageable pieces called shards, while maintaining the ability to combine data from different sources.
Main Shard and Feature Shards
The joining mechanism in Lavender Data is based on a two-level architecture:
- Main Shard: This is the primary data source that contains the core information and serves as the reference point for joining. The shardset with the most samples is chosen as the main shard. If there are multiple shardsets with the same number of samples, the one with the oldest creation date is chosen as the main shard.
- Feature Shards: These are additional data sources that contain supplementary information that can be joined with the main shard.
How Joining Works
The joining process follows these steps:
- Sample Selection: A sample is first retrieved from the main shard using its index.
- UID-based Joining: Each sample has a unique identifier (UID) that is used to join data from feature shards.
- Column Merging: Data from feature shards is merged with the main shard’s data based on matching UIDs.
- Outer Join: If a sample does not have a matching UID in a feature shard, the sample is padded with
None
values for that feature.
You can think it as a SQL outer join operation.
SELECT * FROM main_shard msLEFT OUTER JOIN feature_shard_1 fs_1 ON ms.uid = fs_1.uidLEFT OUTER JOIN feature_shard_2 fs_2 ON ms.uid = fs_2.uidLIMIT 1OFFSET 0;
Example
Let’s say you have three shardsets:
-
shardset_1
uid image_url 1 /image-1.jpg 2 /image-2.jpg 3 /image-3.jpg -
shardset_2
uid caption 1 caption-1 3 caption-3 -
shardset_3
uid aesthetic_score 2 0.8 3 0.7
The samples will be joined as follows:
uid | image_url | caption | aesthetic_score |
---|---|---|---|
1 | /image-1.jpg | caption-1 | None |
2 | /image-2.jpg | None | 0.8 |
3 | /image-3.jpg | caption-3 | 0.7 |