Dataset - Shardsets
A Shardset represents a collection of related columns (features) stored as Shards (files).
Directory Structure
Each shardset can be considered as a directory containing multiple shard files. The directory should be flat, containing the shard files only. If the order matters, sort shards by the filename.
Directory/shardset_1/
- shard.00000.csv
- shard.00001.csv
- …
Directory/shardset_2/
- shard.00000.csv
- shard.00001.csv
- …
File Format
Shard files should be in tabular formats. Currently, Lavender Data supports CSV and Parquet formats.
Each shard file should have the same schema within the same shardset.
uid,image_url,caption0,https://example.com/image-00000.jpg,Caption for image 000001,https://example.com/image-00001.jpg,Caption for image 000012,https://example.com/image-00002.jpg,Caption for image 000023,https://example.com/image-00003.jpg,Caption for image 000034,https://example.com/image-00004.jpg,Caption for image 000045,https://example.com/image-00005.jpg,Caption for image 000056,https://example.com/image-00006.jpg,Caption for image 000067,https://example.com/image-00007.jpg,Caption for image 000078,https://example.com/image-00008.jpg,Caption for image 000089,https://example.com/image-00009.jpg,Caption for image 00009
Location
The directory can be located on a file system or a cloud storage.
The location
field of the shardset is used to identify the location of a shardset.
# file systemshardset.location = "file:///path/to/the/shardset"
# s3shardset.location = "s3://my-bucket/path/to/the/shardset"