Closed liming-ai closed 1 year ago
You don't bother me, thank you for opening these issues! Others will probably encounter similar issues and it is good that the solutions are documented.
Seems like something failed when you saved the dataset to disk. Try loading it and saving it to disk:
from datasets import load_dataset
dataset = load_dataset("yuvalkirstain/pickapic_v1")
dataset.save_to_disk("dataset_path")
and then when training, change the dataset config to load from disk using the dataset path.
Please update that it works?
You don't bother me, thank you for opening these issues! Others will probably encounter similar issues and it is good that the solutions are documented.
Seems like something failed when you saved the dataset to disk. Try loading it and saving it to disk:
from datasets import load_dataset dataset = load_dataset("yuvalkirstain/pickapic_v1") dataset.save_to_disk("dataset_path")
and then when training, change the dataset config to load from disk using the dataset path.
Please update that it works?
Thanks for your reply. Unfortunately at the moment, I cannot download the data using the API. Could you please tell me how to use the data that have been downloaded? I have downloaded all the .parquet
files in huggingface, and there is no .json
file, so I cannot train the model normally.
I see, can you upload the dataset with the from_parquet function?
Something like this:
from datasets import Dataset, concatenate_datasets, DatasetDict
from collections import defaultdict
split2shards, split2dataset = defaultdict(list), {}
for split in ["train", "validation", "test", "validation_unique", "test_unique"]:
for shard_path in <parquet_train_paths>:
split2shards[split].append(Dataset.from_parquet(shard_path))
split2dataset[split] = concatenate_datasets(split2shards[split])
dataset = DatasetDict(split2dataset)
dataset.save_to_disk("pickapic_regular")
Hi, @yuvalkirstain
Sorry to bother you again, but I have a strange question about the dataset. I followed your instruction to download the dataset:
Then I tried to train the model:
It did work well at first and I can train normally, but when I turned off the remote ssh window and re-connected, I had to re-download the whole dataset from scratch. The original downloaded dataset is still here and has not been deleted.
I also tried to change the dataset config and made it load locally. More specially, I changed the
dataset_name
to my local path (which has been fully downloaded in the previous), and setfrom_disk=True
https://github.com/yuvalkirstain/PickScore/blob/013b54d70bf3bd9112251e7ab5ea8b2e915de3dc/trainer/datasetss/clip_hf_dataset.py#L30 https://github.com/yuvalkirstain/PickScore/blob/013b54d70bf3bd9112251e7ab5ea8b2e915de3dc/trainer/datasetss/clip_hf_dataset.py#L33But another error happened:
I am sure that the downloaded dataset has no file named
state.json
when I first trained normally. I have no idea about what's wrong and hope you can give me some advice.