Server - Cache
Lavender Data caches yielded samples/batches to avoid redundant processing.
Cache Key
The cache key is a combination of the hashed iteration parameters and the indices of the samples/batches.
def hash(o: object) -> str: return hashlib.sha256(json.dumps(o).encode("utf-8")).hexdigest()
iteration_hash = hash({ "dataset_id": iteration.dataset.id, "batch_size": iteration.batch_size, "shardsets": [s.id for s in iteration.shardsets], "collater": iteration.collater, "filters": iteration.filters, "preprocessors": iteration.preprocessors,})
cache_key = hash({ "iteration_hash": iteration_hash, "indices": indices,})
Please refer to the source code for more details.
TTL
Cache entries are deleted after a certain time to live (TTL).
You can set the TTL with the LAVENDER_DATA_BATCH_CACHE_TTL
environment variable, in seconds.
export LAVENDER_DATA_BATCH_CACHE_TTL=300 # 300 seconds
Default value is 5 minutes (300 seconds).
Ignoring Cache
You can ignore caches and reprocess the data by setting the no_cache
parameter to True
on creating the iteration instance.
iteration = Iteration.from_dataset( dataset_id=dataset.id, shardsets=[shardset.id], no_cache=True,)
Cache location
Cache is stored in the memory of the server by default.
To support prefetching from the client side, caching can not be disabled. If you want to scale out the server to multiple machines or processes, you can use Redis as a cache backend.
Install the redis
extra and set LAVENDER_DATA_REDIS_URL
environment variable to the desired Redis URL.
pip install lavender-data[redis]
export LAVENDER_DATA_REDIS_URL=redis://localhost:6379/0
If the server does not have enough memory, consider using Redis or setting a smaller TTL.