yuvalkirstain / PickScore

MIT License
373 stars 20 forks source link

How to download a specific split of the dataset? #24

Closed vedantroy closed 2 months ago

vedantroy commented 3 months ago

Is there a way to use the huggingface APIs to only download a portion of the dataset? I see that the parquet files are named with validation, validation_unique, test, train, etc. prefixes, but when trying to download a single split, it seems to download the entire dataset:

load_dataset(
    DATASET_NAME,
    cache_dir=CACHE_DIR,
    split='validation',
)

I'm not sure how huggingface datasets works -- i.e, is there some metadata file that huggingface can use to map "split" to "files in that split"

yuvalkirstain commented 3 months ago

Yes, use streaming=True, so you don't download the entire dataset.

yuvalkirstain commented 2 months ago

Closing. @vedantroy feel free to reopen if it did not work out.