Dataset - Join Method

Lavender Data uses a flexible and efficient approach to join data across multiple shardsets. The system is designed to handle large datasets by breaking them into smaller, manageable pieces called shards, while maintaining the ability to combine data from different sources.

Main Shard and Feature Shards

The joining mechanism in Lavender Data is based on a two-level architecture:

Main Shard: This is the primary data source that contains the core information and serves as the reference point for joining. Main shardset can be configured in the shardset settings page in the UI. If there is no main shardset configured, the one with the oldest creation date is chosen as the main shardset.
Feature Shards: These are additional data sources that contain supplementary information that can be joined with the main shard.

How Joining Works

The joining process follows these steps:

Sample Selection: A sample is first retrieved from the main shard using its index.
UID-based Joining: Each sample has a unique identifier (UID) that is used to join data from feature shards.
Column Merging: Data from feature shards is merged with the main shard’s data based on matching UIDs.
Outer Join: If a sample does not have a matching UID in a feature shard, the sample is padded with None values for that feature.

You can think it as a SQL outer join operation.

SELECT * FROM main_shard ms
LEFT OUTER JOIN feature_shard_1 fs_1 ON ms.uid = fs_1.uid
LEFT OUTER JOIN feature_shard_2 fs_2 ON ms.uid = fs_2.uid
LIMIT 1
OFFSET 0;

Example

Let’s say you have three shardsets:

shardset_1

uid image_url
1 /image-1.jpg
2 /image-2.jpg
3 /image-3.jpg
shardset_2

uid caption
1 caption-1
3 caption-3
shardset_3

uid aesthetic_score
2 0.8
3 0.7

uid	image_url
1	/image-1.jpg
2	/image-2.jpg
3	/image-3.jpg

uid	caption
1	caption-1
3	caption-3

uid	aesthetic_score
2	0.8
3	0.7

The samples will be joined as follows:

uid	image_url	caption	aesthetic_score
1	/image-1.jpg	caption-1	`None`
2	/image-2.jpg	`None`	0.8
3	/image-3.jpg	caption-3	0.7