Server - Cache
Lavender Data caches yielded samples/batches to avoid redundant processing.
Cache Key
The cache key is a combination of the hashed iteration parameters and the indices of the samples/batches.
def hash(o: object) -> str: return hashlib.sha256(json.dumps(o).encode("utf-8")).hexdigest()
iteration_hash = hash({ "dataset_id": iteration.dataset.id, "batch_size": iteration.batch_size, "shardsets": [s.id for s in iteration.shardsets], "collater": iteration.collater, "filters": iteration.filters, "preprocessors": iteration.preprocessors,})
cache_key = hash({ "iteration_hash": iteration_hash, "indices": indices,})Please refer to the source code for more details.
TTL
Cache entries are deleted after a certain time to live (TTL).
You can set the TTL with the LAVENDER_DATA_BATCH_CACHE_TTL environment variable, in seconds.
export LAVENDER_DATA_BATCH_CACHE_TTL=300 # 300 secondsDefault value is 5 minutes (300 seconds).
Ignoring Cache
You can ignore caches and reprocess the data by setting the no_cache parameter to True on creating the iteration instance.
iteration = Iteration.from_dataset( dataset_id=dataset.id, shardsets=[shardset.id], no_cache=True,)Cache location
Cache is stored in the memory of the server by default.
To support prefetching from the client side, caching can not be disabled. If you want to scale out the server to multiple machines or processes, you can use Redis as a cache backend.
Install the redis extra and set LAVENDER_DATA_REDIS_URL environment variable to the desired Redis URL.
pip install lavender-data[redis]
export LAVENDER_DATA_REDIS_URL=redis://localhost:6379/0If the server does not have enough memory, consider using Redis or setting a smaller TTL.