ybracke / transnormer

A lexical normalizer for historical spelling variants using a transformer architecture.
GNU General Public License v3.0
6 stars 1 forks source link

Cache trouble #41

Closed ybracke closed 1 year ago

ybracke commented 1 year ago

The datasets library stores previously loaded and processed datasets in a cache to reload them quickly. The cache is located here: ~/.cache/huggingface/datasets

I had an issue that went back to caching. I resampled a dataset and then applied a map() transformation. Between different runs, I changed to what size I sampled the dataset but kept the transformation stable. It seems as if the resizing of the dataset is ignored and datasets simply loaded the first cached version of the transformed dataset. Example: First I resampled to 100 examples and applied map, then I ran it again with resampling to 10_000 examples and applied map. The second time it just loaded the 100-example processed dataset from cache.

Links:

Possible solutions

  1. Add load_from_cache_file=False as argument to map()
  2. Disable caching on a global scale to prevent the issue: Add the line datasets.disable_caching(), at the beginning of the script.
  3. Remove cached datasets: /home/USERNAME/.cache/huggingface/datasets

This will lead to a reprocessing of the data in every run.

Possible TODO

Set the boolean value of load_from_cache_file based on an entry in config file.

ybracke commented 1 year ago

Update:

ybracke commented 1 year ago

Update: commit 46e79276 prevents loading from cache in train_model.py