Closed ybracke closed 1 year ago
Update:
datasets.disable_caching()
in tests/test_train_model
, but running the tests still created the files in the cache. I could try to change the arguments of the map
function (see here), but this is not possible in the test file and would mean changing the original function. Is that worth it?Update: commit 46e79276 prevents loading from cache in train_model.py
The
datasets
library stores previously loaded and processed datasets in a cache to reload them quickly. The cache is located here:~/.cache/huggingface/datasets
I had an issue that went back to caching. I resampled a dataset and then applied a
map()
transformation. Between different runs, I changed to what size I sampled the dataset but kept the transformation stable. It seems as if the resizing of the dataset is ignored anddatasets
simply loaded the first cached version of the transformed dataset. Example: First I resampled to 100 examples and appliedmap
, then I ran it again with resampling to 10_000 examples and appliedmap
. The second time it just loaded the 100-example processed dataset from cache.Links:
Possible solutions
load_from_cache_file=False
as argument tomap()
datasets.disable_caching()
, at the beginning of the script./home/USERNAME/.cache/huggingface/datasets
This will lead to a reprocessing of the data in every run.
Possible TODO
Set the boolean value of
load_from_cache_file
based on an entry in config file.