yuvalkirstain / PickScore

MIT License
373 stars 20 forks source link

Issues about data #3

Closed liming-ai closed 8 months ago

liming-ai commented 1 year ago

Hi, @yuvalkirstain

Sorry to bother you again, but I have a strange question about the dataset. I followed your instruction to download the dataset:

from datasets import load_dataset
dataset = load_dataset("yuvalkirstain/pickapic_v1", num_proc=64)

Then I tried to train the model:

accelerate launch --dynamo_backend no --gpu_ids all --num_processes 8  --num_machines 1 --use_deepspeed trainer/scripts/train.py +experiment=clip_h output_dir=output

It did work well at first and I can train normally, but when I turned off the remote ssh window and re-connected, I had to re-download the whole dataset from scratch. The original downloaded dataset is still here and has not been deleted.

I also tried to change the dataset config and made it load locally. More specially, I changed the dataset_name to my local path (which has been fully downloaded in the previous), and set from_disk=True https://github.com/yuvalkirstain/PickScore/blob/013b54d70bf3bd9112251e7ab5ea8b2e915de3dc/trainer/datasetss/clip_hf_dataset.py#L30 https://github.com/yuvalkirstain/PickScore/blob/013b54d70bf3bd9112251e7ab5ea8b2e915de3dc/trainer/datasetss/clip_hf_dataset.py#L33

But another error happened:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/tiger/code/PickScore/trainer/scripts/train.py:175 in <module>                              │
│                                                                                                  │
│   172                                                                                            │
│   173                                                                                            │
│   174 if __name__ == '__main__':                                                                 │
│ ❱ 175 │   main()                                                                                 │
│   176                                                                                            │
│                                                                                                  │
│ /home/tiger/.local/lib/python3.9/site-packages/hydra/main.py:94 in decorated_main                │
│                                                                                                  │
│    91 │   │   │   │   else:                                                                      │
│    92 │   │   │   │   │   # no return value from run_hydra() as it may sometime actually run t   │
│    93 │   │   │   │   │   # multiple times (--multirun)                                          │
│ ❱  94 │   │   │   │   │   _run_hydra(                                                            │
│    95 │   │   │   │   │   │   args=args,                                                         │
│    96 │   │   │   │   │   │   args_parser=args_parser,                                           │
│    97 │   │   │   │   │   │   task_function=task_function,                                       │
│                                                                                                  │
│ /home/tiger/.local/lib/python3.9/site-packages/hydra/_internal/utils.py:394 in _run_hydra        │
│                                                                                                  │
│   391 │   │                                                                                      │
│   392 │   │   if args.run or args.multirun:                                                      │
│   393 │   │   │   run_mode = hydra.get_mode(config_name=config_name, overrides=overrides)        │
│ ❱ 394 │   │   │   _run_app(                                                                      │
│   395 │   │   │   │   run=args.run,                                                              │
│   396 │   │   │   │   multirun=args.multirun,                                                    │
│   397 │   │   │   │   mode=run_mode,                                                             │
│                                                                                                  │
│ /home/tiger/.local/lib/python3.9/site-packages/hydra/_internal/utils.py:457 in _run_app          │
│                                                                                                  │
│   454 │   │   │   overrides.extend(["hydra.mode=MULTIRUN"])                                      │
│   455 │                                                                                          │
│   456 │   if mode == RunMode.RUN:                                                                │
│ ❱ 457 │   │   run_and_report(                                                                    │
│   458 │   │   │   lambda: hydra.run(                                                             │
│   459 │   │   │   │   config_name=config_name,                                                   │
│   460 │   │   │   │   task_function=task_function,                                               │
│                                                                                                  │
│ /home/tiger/.local/lib/python3.9/site-packages/hydra/_internal/utils.py:223 in run_and_report    │
│                                                                                                  │
│   220 │   │   return func()                                                                      │
│   221 │   except Exception as ex:                                                                │
│   222 │   │   if _is_env_set("HYDRA_FULL_ERROR") or is_under_debugger():                         │
│ ❱ 223 │   │   │   raise ex                                                                       │
│   224 │   │   else:                                                                              │
│   225 │   │   │   try:                                                                           │
│   226 │   │   │   │   if isinstance(ex, CompactHydraException):                                  │
│                                                                                                  │
│ /home/tiger/.local/lib/python3.9/site-packages/hydra/_internal/utils.py:220 in run_and_report    │
│                                                                                                  │
│   217                                                                                            │
│   218 def run_and_report(func: Any) -> Any:                                                      │
│   219 │   try:                                                                                   │
│ ❱ 220 │   │   return func()                                                                      │
│   221 │   except Exception as ex:                                                                │
│   222 │   │   if _is_env_set("HYDRA_FULL_ERROR") or is_under_debugger():                         │
│   223 │   │   │   raise ex                                                                       │
│                                                                                                  │
│ /home/tiger/.local/lib/python3.9/site-packages/hydra/_internal/utils.py:458 in <lambda>          │
│                                                                                                  │
│   455 │                                                                                          │
│   456 │   if mode == RunMode.RUN:                                                                │
│   457 │   │   run_and_report(                                                                    │
│ ❱ 458 │   │   │   lambda: hydra.run(                                                             │
│   459 │   │   │   │   config_name=config_name,                                                   │
│   460 │   │   │   │   task_function=task_function,                                               │
│   461 │   │   │   │   overrides=overrides,                                                       │
│                                                                                                  │
│ /home/tiger/.local/lib/python3.9/site-packages/hydra/_internal/hydra.py:132 in run               │
│                                                                                                  │
│   129 │   │   callbacks.on_run_end(config=cfg, config_name=config_name, job_return=ret)          │
│   130 │   │                                                                                      │
│   131 │   │   # access the result to trigger an exception in case the job failed.                │
│ ❱ 132 │   │   _ = ret.return_value                                                               │
│   133 │   │                                                                                      │
│   134 │   │   return ret                                                                         │
│   135                                                                                            │
│                                                                                                  │
│ /home/tiger/.local/lib/python3.9/site-packages/hydra/core/utils.py:260 in return_value           │
│                                                                                                  │
│   257 │   │   │   sys.stderr.write(                                                              │
│   258 │   │   │   │   f"Error executing job with overrides: {self.overrides}" + os.linesep       │
│   259 │   │   │   )                                                                              │
│ ❱ 260 │   │   │   raise self._return_value                                                       │
│   261 │                                                                                          │
│   262 │   @return_value.setter                                                                   │
│   263 │   def return_value(self, value: Any) -> None:                                            │
│                                                                                                  │
│ /home/tiger/.local/lib/python3.9/site-packages/hydra/core/utils.py:186 in run_job                │
│                                                                                                  │
│   183 │   │   with env_override(hydra_cfg.hydra.job.env_set):                                    │
│   184 │   │   │   callbacks.on_job_start(config=config, task_function=task_function)             │
│   185 │   │   │   try:                                                                           │
│ ❱ 186 │   │   │   │   ret.return_value = task_function(task_cfg)                                 │
│   187 │   │   │   │   ret.status = JobStatus.COMPLETED                                           │
│   188 │   │   │   except Exception as e:                                                         │
│   189 │   │   │   │   ret.return_value = e                                                       │
│                                                                                                  │
│ /home/tiger/code/PickScore/trainer/scripts/train.py:83 in main                                   │
│                                                                                                  │
│    80 │   logger.info(f"Loading lr scheduler")                                                   │
│    81 │   lr_scheduler = load_scheduler(cfg.lr_scheduler, optimizer)                             │
│    82 │   logger.info(f"Loading dataloaders")                                                    │
│ ❱  83 │   split2dataloader = load_dataloaders(cfg.dataset)                                       │
│    84 │                                                                                          │
│    85 │   dataloaders = list(split2dataloader.values())                                          │
│    86 │   model, optimizer, lr_scheduler, *dataloaders = accelerator.prepare(model, optimizer,   │
│                                                                                                  │
│ /home/tiger/code/PickScore/trainer/scripts/train.py:22 in load_dataloaders                       │
│                                                                                                  │
│    19 def load_dataloaders(cfg: DictConfig) -> Any:                                              │
│    20 │   dataloaders = {}                                                                       │
│    21 │   for split in [cfg.train_split_name, cfg.valid_split_name, cfg.test_split_name]:        │
│ ❱  22 │   │   dataset = instantiate_with_cfg(cfg, split=split)                                   │
│    23 │   │   should_shuffle = split == cfg.train_split_name                                     │
│    24 │   │   dataloaders[split] = torch.utils.data.DataLoader(                                  │
│    25 │   │   │   dataset,                                                                       │
│                                                                                                  │
│ /home/tiger/code/PickScore/trainer/configs/configs.py:74 in instantiate_with_cfg                 │
│                                                                                                  │
│    71                                                                                            │
│    72 def instantiate_with_cfg(cfg: DictConfig, **kwargs):                                       │
│    73 │   target = _locate(cfg._target_)                                                         │
│ ❱  74 │   return target(cfg, **kwargs)                                                           │
│    75                                                                                            │
│    76                                                                                            │
│    77 defaults = [                                                                               │
│                                                                                                  │
│ /home/tiger/code/PickScore/trainer/datasetss/clip_hf_dataset.py:71 in __init__                   │
│                                                                                                  │
│    68 │   │   self.split = split                                                                 │
│    69 │   │   logger.info(f"Loading {self.split} dataset")                                       │
│    70 │   │                                                                                      │
│ ❱  71 │   │   self.dataset = self.load_hf_dataset(self.split)                                    │
│    72 │   │   logger.info(f"Loaded {len(self.dataset)} examples from {self.split} dataset")      │
│    73 │   │                                                                                      │
│    74 │   │   if self.cfg.keep_only_different:                                                   │
│                                                                                                  │
│ /home/tiger/code/PickScore/trainer/datasetss/clip_hf_dataset.py:138 in load_hf_dataset           │
│                                                                                                  │
│   135 │                                                                                          │
│   136 │   def load_hf_dataset(self, split: str) -> Dataset:                                      │
│   137 │   │   if self.cfg.from_disk:                                                             │
│ ❱ 138 │   │   │   dataset = load_from_disk(self.cfg.dataset_name)[split]                         │
│   139 │   │   else:                                                                              │
│   140 │   │   │   dataset = load_dataset(                                                        │
│   141 │   │   │   │   self.cfg.dataset_name,                                                     │
│                                                                                                  │
│ /home/tiger/.local/lib/python3.9/site-packages/datasets/load.py:1872 in load_from_disk           │
│                                                                                                  │
│   1869 │   if not fs.exists(dest_dataset_path):                                                  │
│   1870 │   │   raise FileNotFoundError(f"Directory {dataset_path} not found")                    │
│   1871 │   if fs.isfile(path_join(dest_dataset_path, config.DATASET_INFO_FILENAME)):             │
│ ❱ 1872 │   │   return Dataset.load_from_disk(dataset_path, keep_in_memory=keep_in_memory, stora  │
│   1873 │   elif fs.isfile(path_join(dest_dataset_path, config.DATASETDICT_JSON_FILENAME)):       │
│   1874 │   │   return DatasetDict.load_from_disk(dataset_path, keep_in_memory=keep_in_memory, s  │
│   1875 │   else:                                                                                 │
│                                                                                                  │
│ /home/tiger/.local/lib/python3.9/site-packages/datasets/arrow_dataset.py:1558 in load_from_disk  │
│                                                                                                  │
│   1555 │   │   │   dataset_path = Dataset._build_local_temp_path(src_dataset_path)               │
│   1556 │   │   │   fs.download(src_dataset_path, dataset_path.as_posix(), recursive=True)        │
│   1557 │   │                                                                                     │
│ ❱ 1558 │   │   with open(Path(dataset_path, config.DATASET_STATE_JSON_FILENAME).as_posix(), enc  │
│   1559 │   │   │   state = json.load(state_file)                                                 │
│   1560 │   │   with open(Path(dataset_path, config.DATASET_INFO_FILENAME).as_posix(), encoding=  │
│   1561 │   │   │   dataset_info = DatasetInfo.from_dict(json.load(dataset_info_file))            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: [Errno 2] No such file or directory:
'/home/tiger/.cache/huggingface/datasets/yuvalkirstain___parquet/yuvalkirstain--pickapic_v1-2eaf06d63c9783b9/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/state.json'

I am sure that the downloaded dataset has no file named state.json when I first trained normally. I have no idea about what's wrong and hope you can give me some advice.

yuvalkirstain commented 1 year ago

You don't bother me, thank you for opening these issues! Others will probably encounter similar issues and it is good that the solutions are documented.

Seems like something failed when you saved the dataset to disk. Try loading it and saving it to disk:

from datasets import load_dataset
dataset = load_dataset("yuvalkirstain/pickapic_v1")
dataset.save_to_disk("dataset_path")

and then when training, change the dataset config to load from disk using the dataset path.

Please update that it works?

liming-ai commented 1 year ago

You don't bother me, thank you for opening these issues! Others will probably encounter similar issues and it is good that the solutions are documented.

Seems like something failed when you saved the dataset to disk. Try loading it and saving it to disk:

from datasets import load_dataset
dataset = load_dataset("yuvalkirstain/pickapic_v1")
dataset.save_to_disk("dataset_path")

and then when training, change the dataset config to load from disk using the dataset path.

Please update that it works?

Thanks for your reply. Unfortunately at the moment, I cannot download the data using the API. Could you please tell me how to use the data that have been downloaded? I have downloaded all the .parquet files in huggingface, and there is no .json file, so I cannot train the model normally.

yuvalkirstain commented 1 year ago

I see, can you upload the dataset with the from_parquet function?

Something like this:

from datasets import Dataset, concatenate_datasets, DatasetDict
from collections import defaultdict

split2shards, split2dataset = defaultdict(list), {}
for split in ["train", "validation", "test", "validation_unique", "test_unique"]:
  for shard_path in <parquet_train_paths>:
    split2shards[split].append(Dataset.from_parquet(shard_path))
  split2dataset[split] = concatenate_datasets(split2shards[split])
dataset = DatasetDict(split2dataset)
dataset.save_to_disk("pickapic_regular")