xinntao / EDVR

Winning Solution in NTIRE19 Challenges on Video Restoration and Enhancement (CVPR19 Workshops) - Video Restoration with Enhanced Deformable Convolutional Networks. EDVR has been merged into BasicSR and this repo is a mirror of BasicSR.
https://github.com/xinntao/BasicSR
1.48k stars 320 forks source link

custom dataset when training detects only 3 images #72

Open alexisdrakopoulos opened 4 years ago

alexisdrakopoulos commented 4 years ago

I've created my own dataset with exact same structure as REDS other than different resolution. I successfully created the LMDB dataset and it passes all the tests I throw at it.

However when I attempt to train, it states there are only 3 images and then crashes.

I also notice that it seems to create the model/load things multiple times when compared to the example log provided in the repo. I can't figure out what causes this behavior.

I have over 30,000 images.

Full traceback:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
19-07-27 14:03:39.205 - INFO: Temporal augmentation interval list: [1], with random reverse is False.
19-07-27 14:03:39.205 - INFO: Temporal augmentation interval list: [1], with random reverse is False.
19-07-27 14:03:39.205 - INFO: Using cache keys: /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb/meta_info.pkl
19-07-27 14:03:39.205 - INFO: Using cache keys: /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb/meta_info.pkl
19-07-27 14:03:39.205 - INFO: Using cache keys - /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb/meta_info.pkl.
19-07-27 14:03:39.205 - INFO: Using cache keys - /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb/meta_info.pkl.
19-07-27 14:03:39.209 - INFO: Dataset [REDSDataset - REDS] is created.
19-07-27 14:03:39.209 - INFO: Dataset [REDSDataset - REDS] is created.
19-07-27 14:03:40.204 - INFO: Temporal augmentation interval list: [1], with random reverse is False.
19-07-27 14:03:40.204 - INFO: Temporal augmentation interval list: [1], with random reverse is False.
19-07-27 14:03:40.204 - INFO: Using cache keys: /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb/meta_info.pkl
19-07-27 14:03:40.204 - INFO: Using cache keys: /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb/meta_info.pkl
19-07-27 14:03:40.204 - INFO:   name: 001_EDVRwoTSA_scratch_lr4e-4_600k_REDS_LrCAR4S
  use_tb_logger: True
  model: VideoSR_base
  distortion: sr
  scale: 4
  gpu_ids: [0, 1, 2, 3, 4, 5, 6, 7]
  datasets:[
    train:[
      name: REDS
      mode: REDS
      interval_list: [1]
      random_reverse: False
      border_mode: False
      dataroot_GT: /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb
      dataroot_LQ: /mnt/sdb/EDVR/datasets/train_sharp_bicubic_wval.lmdb
      cache_keys: /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb/meta_info.pkl
      N_frames: 5
      use_shuffle: True
      n_workers: 3
      batch_size: 32
      GT_size: 256
      LQ_size: 64
      use_flip: True
      use_rot: True
      color: RGB
      phase: train
      scale: 4
      data_type: lmdb
    ]
  ]
  network_G:[
    which_model_G: EDVR
    nf: 64
    nframes: 5
    groups: 8
    front_RBs: 5
    back_RBs: 10
    predeblur: False
    HR_in: False
    w_TSA: False
    scale: 4
  ]
  path:[
    pretrain_model_G: None
    strict_load: True
    resume_state: None
    root: /mnt/sdb/EDVR
    experiments_root: /mnt/sdb/EDVR/experiments/001_EDVRwoTSA_scratch_lr4e-4_600k_REDS_LrCAR4S
    models: /mnt/sdb/EDVR/experiments/001_EDVRwoTSA_scratch_lr4e-4_600k_REDS_LrCAR4S/models
    training_state: /mnt/sdb/EDVR/experiments/001_EDVRwoTSA_scratch_lr4e-4_600k_REDS_LrCAR4S/training_state
    log: /mnt/sdb/EDVR/experiments/001_EDVRwoTSA_scratch_lr4e-4_600k_REDS_LrCAR4S
    val_images: /mnt/sdb/EDVR/experiments/001_EDVRwoTSA_scratch_lr4e-4_600k_REDS_LrCAR4S/val_images
  ]
  train:[
    lr_G: 0.0004
    lr_scheme: CosineAnnealingLR_Restart
    beta1: 0.9
    beta2: 0.99
    niter: 600000
    warmup_iter: -1
    T_period: [150000, 150000, 150000, 150000]
    restarts: [150000, 300000, 450000]
    restart_weights: [1, 1, 1]
    eta_min: 1e-07
    pixel_criterion: cb
    pixel_weight: 1.0
    val_freq: 2000.0
    manual_seed: 0
  ]
  logger:[
    print_freq: 100
    save_checkpoint_freq: 2000.0
  ]
  is_train: True
  dist: True
19-07-27 14:03:40.204 - INFO: Using cache keys - /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb/meta_info.pkl.
19-07-27 14:03:40.204 - INFO: Using cache keys - /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb/meta_info.pkl.

19-07-27 14:03:40.204 - INFO: Temporal augmentation interval list: [1], with random reverse is False.
19-07-27 14:03:40.205 - INFO: Using cache keys: /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb/meta_info.pkl
19-07-27 14:03:40.205 - INFO: Using cache keys - /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb/meta_info.pkl.
19-07-27 14:03:40.205 - INFO: Temporal augmentation interval list: [1], with random reverse is False.
19-07-27 14:03:40.205 - INFO: Using cache keys: /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb/meta_info.pkl
19-07-27 14:03:40.205 - INFO: Using cache keys - /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb/meta_info.pkl.
19-07-27 14:03:40.206 - INFO: Temporal augmentation interval list: [1], with random reverse is False.
19-07-27 14:03:40.206 - INFO: Using cache keys: /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb/meta_info.pkl
19-07-27 14:03:40.206 - INFO: Using cache keys - /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb/meta_info.pkl.
19-07-27 14:03:40.209 - INFO: Dataset [REDSDataset - REDS] is created.
19-07-27 14:03:40.209 - INFO: Dataset [REDSDataset - REDS] is created.
19-07-27 14:03:40.209 - INFO: Dataset [REDSDataset - REDS] is created.
19-07-27 14:03:40.212 - INFO: Dataset [REDSDataset - REDS] is created.
19-07-27 14:03:40.213 - INFO: Dataset [REDSDataset - REDS] is created.
19-07-27 14:03:40.946 - INFO: Random seed: 0
19-07-27 14:03:40.949 - INFO: Temporal augmentation interval list: [1], with random reverse is False.
19-07-27 14:03:40.949 - INFO: Using cache keys: /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb/meta_info.pkl
19-07-27 14:03:40.949 - INFO: Using cache keys - /mnt/sdb/EDVR/datasets/train_sharp_wval.lmdb/meta_info.pkl.
19-07-27 14:03:40.956 - INFO: Dataset [REDSDataset - REDS] is created.
19-07-27 14:03:40.956 - INFO: Number of train images: 3, iters: 1
19-07-27 14:03:40.956 - INFO: Total epochs needed: 3000 for iters 600,000
19-07-27 14:03:43.571 - INFO: Model [VideoSRBaseModel] is created.
19-07-27 14:03:43.571 - INFO: Start training from epoch: 0, iter: 0
19-07-27 14:03:43.572 - INFO: Model [VideoSRBaseModel] is created.
19-07-27 14:03:43.572 - INFO: Start training from epoch: 0, iter: 0
19-07-27 14:03:43.573 - INFO: Network G structure: DistributedDataParallel, with parameters: 2,996,259
19-07-27 14:03:43.573 - INFO: EDVR(
  (conv_first): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (feature_extraction): Sequential(
    (0): ResidualBlock_noBN(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (1): ResidualBlock_noBN(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (2): ResidualBlock_noBN(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (3): ResidualBlock_noBN(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (4): ResidualBlock_noBN(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
  )
  (fea_L2_conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (fea_L2_conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (fea_L3_conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (fea_L3_conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (pcd_align): PCD_Align(
    (L3_offset_conv1): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (L3_offset_conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (L3_dcnpack): DCN_sep(
      (conv_offset_mask): Conv2d(64, 216, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (L2_offset_conv1): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (L2_offset_conv2): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (L2_offset_conv3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (L2_dcnpack): DCN_sep(
      (conv_offset_mask): Conv2d(64, 216, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (L2_fea_conv): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (L1_offset_conv1): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (L1_offset_conv2): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (L1_offset_conv3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (L1_dcnpack): DCN_sep(
      (conv_offset_mask): Conv2d(64, 216, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (L1_fea_conv): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (cas_offset_conv1): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (cas_offset_conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (cas_dcnpack): DCN_sep(
      (conv_offset_mask): Conv2d(64, 216, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (lrelu): LeakyReLU(negative_slope=0.1, inplace)
  )
  (tsa_fusion): Conv2d(320, 64, kernel_size=(1, 1), stride=(1, 1))
  (recon_trunk): Sequential(
    (0): ResidualBlock_noBN(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (1): ResidualBlock_noBN(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (2): ResidualBlock_noBN(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (3): ResidualBlock_noBN(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (4): ResidualBlock_noBN(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (5): ResidualBlock_noBN(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (6): ResidualBlock_noBN(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (7): ResidualBlock_noBN(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (8): ResidualBlock_noBN(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
    (9): ResidualBlock_noBN(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
  )
  (upconv1): Conv2d(64, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (upconv2): Conv2d(64, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (pixel_shuffle): PixelShuffle(upscale_factor=2)
  (HRconv): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv_last): Conv2d(64, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (lrelu): LeakyReLU(negative_slope=0.1, inplace)
)
19-07-27 14:03:43.574 - INFO: Model [VideoSRBaseModel] is created.
19-07-27 14:03:43.574 - INFO: Start training from epoch: 0, iter: 0
19-07-27 14:03:43.575 - INFO: Model [VideoSRBaseModel] is created.
19-07-27 14:03:43.575 - INFO: Model [VideoSRBaseModel] is created.
19-07-27 14:03:43.575 - INFO: Start training from epoch: 0, iter: 0
19-07-27 14:03:43.575 - INFO: Start training from epoch: 0, iter: 0
19-07-27 14:03:43.575 - INFO: Model [VideoSRBaseModel] is created.
19-07-27 14:03:43.575 - INFO: Model [VideoSRBaseModel] is created.
19-07-27 14:03:43.575 - INFO: Start training from epoch: 0, iter: 0
19-07-27 14:03:43.575 - INFO: Start training from epoch: 0, iter: 0
19-07-27 14:03:43.577 - INFO: Model [VideoSRBaseModel] is created.
19-07-27 14:03:43.578 - INFO: Start training from epoch: 0, iter: 0
Traceback (most recent call last):
  File "train.py", line 198, in <module>
    main()
  File "train.py", line 153, in main
    for _, train_data in enumerate(train_loader):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
    return self._process_next_batch(batch)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
Traceback (most recent call last):
    raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/mnt/sdb/EDVR/codes/data/REDS_dataset.py", line 110, in __getitem__
    name_a, name_b = key.split('_')
ValueError: not enough values to unpack (expected 2, got 1)

  File "train.py", line 198, in <module>
    main()
  File "train.py", line 153, in main
    for _, train_data in enumerate(train_loader):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
    return self._process_next_batch(batch)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/mnt/sdb/EDVR/codes/data/REDS_dataset.py", line 110, in __getitem__
    name_a, name_b = key.split('_')
ValueError: not enough values to unpack (expected 2, got 1)

Traceback (most recent call last):
  File "train.py", line 198, in <module>
    main()
  File "train.py", line 153, in main
    for _, train_data in enumerate(train_loader):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
    return self._process_next_batch(batch)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/mnt/sdb/EDVR/codes/data/REDS_dataset.py", line 110, in __getitem__
    name_a, name_b = key.split('_')
ValueError: not enough values to unpack (expected 2, got 1)

Traceback (most recent call last):
  File "train.py", line 198, in <module>
    main()
  File "train.py", line 153, in main
    for _, train_data in enumerate(train_loader):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
    return self._process_next_batch(batch)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/mnt/sdb/EDVR/codes/data/REDS_dataset.py", line 110, in __getitem__
    name_a, name_b = key.split('_')
ValueError: not enough values to unpack (expected 2, got 1)

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3588) is killed by signal: Terminated. 
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3586) is killed by signal: Terminated. 
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3581) is killed by signal: Terminated. 
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3582) is killed by signal: Terminated. 
Traceback (most recent call last):
  File "train.py", line 198, in <module>
    main()
  File "train.py", line 153, in main
    for _, train_data in enumerate(train_loader):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
    return self._process_next_batch(batch)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/mnt/sdb/EDVR/codes/data/REDS_dataset.py", line 110, in __getitem__
    name_a, name_b = key.split('_')
ValueError: not enough values to unpack (expected 2, got 1)

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3596) is killed by signal: Terminated. 
Traceback (most recent call last):
  File "train.py", line 198, in <module>
    main()
  File "train.py", line 153, in main
    for _, train_data in enumerate(train_loader):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
    return self._process_next_batch(batch)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/mnt/sdb/EDVR/codes/data/REDS_dataset.py", line 110, in __getitem__
    name_a, name_b = key.split('_')
ValueError: not enough values to unpack (expected 2, got 1)

Traceback (most recent call last):
  File "train.py", line 198, in <module>
    main()
  File "train.py", line 153, in main
    for _, train_data in enumerate(train_loader):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
    return self._process_next_batch(batch)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/mnt/sdb/EDVR/codes/data/REDS_dataset.py", line 110, in __getitem__
    name_a, name_b = key.split('_')
ValueError: not enough values to unpack (expected 2, got 1)

Traceback (most recent call last):
  File "train.py", line 198, in <module>
    main()
  File "train.py", line 153, in main
    for _, train_data in enumerate(train_loader):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
    return self._process_next_batch(batch)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/mnt/sdb/EDVR/codes/data/REDS_dataset.py", line 110, in __getitem__
    name_a, name_b = key.split('_')
ValueError: not enough values to unpack (expected 2, got 1)

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3590) is killed by signal: Terminated. 
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in <module>
    main()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main
    cmd=process.args)
subprocess.CalledProcessError: Command '['/home/ubuntu/anaconda3/envs/pytorch_p36/bin/python', '-u', 'train.py', '--local_rank=0', '-opt', 'options/train/train_EDVR_woTSA_M.yml', '--launcher', 'pytorch']' returned non-zero exit status 1.
xinntao commented 4 years ago

It seems that the problem arises from the key. You need to carefully deal with the keys.

yushizhiyao commented 4 years ago

Hi, I have met the same problems. Actually I think it is an error caused by vesion mismatched. In ../codes/data/REDS_dataset.py line 52, It seems that opt['cache_keys'] stores a list. However, in ../codes/data_scripts/create_lmdb_mp.py, meta_info.pkl stores a dict like {name:' REDS ', resolution:' 3_180_320 ', keys:[000_00000000, .....]} We can detect the different between 'REDS_trainval_keys.pkl' and self 'meta_info.pkl' files by simple test function So, change either of them @xinntao , @alexisdrakopoulos

xinntao commented 4 years ago

Thanks @yushizhiyao The cache_keys indeed mismatches with that generated in lmdb. We have updated it to read the cache_keys from the lmdb files.