Cache trouble - Githubissues

ybracke commented 1 year ago

The datasets library stores previously loaded and processed datasets in a cache to reload them quickly. The cache is located here: ~/.cache/huggingface/datasets

I had an issue that went back to caching. I resampled a dataset and then applied a map() transformation. Between different runs, I changed to what size I sampled the dataset but kept the transformation stable. It seems as if the resizing of the dataset is ignored and datasets simply loaded the first cached version of the transformed dataset. Example: First I resampled to 100 examples and applied map, then I ran it again with resampling to 10_000 examples and applied map. The second time it just loaded the 100-example processed dataset from cache.

Links:

Possible solutions

Add load_from_cache_file=False as argument to map()
Disable caching on a global scale to prevent the issue: Add the line datasets.disable_caching(), at the beginning of the script.
Remove cached datasets: /home/USERNAME/.cache/huggingface/datasets

This will lead to a reprocessing of the data in every run.

Possible TODO

Set the boolean value of load_from_cache_file based on an entry in config file.

ybracke commented 1 year ago

Update:

I did datasets.disable_caching() in tests/test_train_model, but running the tests still created the files in the cache. I could try to change the arguments of the map function (see here), but this is not possible in the test file and would mean changing the original function. Is that worth it?

ybracke commented 1 year ago

Update: commit 46e79276 prevents loading from cache in train_model.py

ybracke / transnormer

Cache trouble #41

Possible solutions

Possible TODO