Skip to content

Dataset - Join Method

Lavender Data uses a flexible and efficient approach to join data across multiple shardsets. The system is designed to handle large datasets by breaking them into smaller, manageable pieces called shards, while maintaining the ability to combine data from different sources.

Main Shard and Feature Shards

The joining mechanism in Lavender Data is based on a two-level architecture:

  1. Main Shard: This is the primary data source that contains the core information and serves as the reference point for joining. The shardset with the most samples is chosen as the main shard. If there are multiple shardsets with the same number of samples, the one with the oldest creation date is chosen as the main shard.
  2. Feature Shards: These are additional data sources that contain supplementary information that can be joined with the main shard.

How Joining Works

The joining process follows these steps:

  1. Sample Selection: A sample is first retrieved from the main shard using its index.
  2. UID-based Joining: Each sample has a unique identifier (UID) that is used to join data from feature shards.
  3. Column Merging: Data from feature shards is merged with the main shard’s data based on matching UIDs.
  4. Outer Join: If a sample does not have a matching UID in a feature shard, the sample is padded with None values for that feature.

You can think it as a SQL outer join operation.

SELECT * FROM main_shard ms
LEFT OUTER JOIN feature_shard_1 fs_1 ON ms.uid = fs_1.uid
LEFT OUTER JOIN feature_shard_2 fs_2 ON ms.uid = fs_2.uid
LIMIT 1
OFFSET 0;

Example

Let’s say you have three shardsets:

  1. shardset_1

    uidimage_url
    1/image-1.jpg
    2/image-2.jpg
    3/image-3.jpg
  2. shardset_2

    uidcaption
    1caption-1
    3caption-3
  3. shardset_3

    uidaesthetic_score
    20.8
    30.7

The samples will be joined as follows:

uidimage_urlcaptionaesthetic_score
1/image-1.jpgcaption-1None
2/image-2.jpgNone0.8
3/image-3.jpgcaption-30.7