xinntao / EDVR

Winning Solution in NTIRE19 Challenges on Video Restoration and Enhancement (CVPR19 Workshops) - Video Restoration with Enhanced Deformable Convolutional Networks. EDVR has been merged into BasicSR and this repo is a mirror of BasicSR.
1.48k stars 320 forks source link

run on different gpu? training images get corrupted during training? #187

Open Shanshan-Huang opened 3 years ago

Shanshan-Huang commented 3 years ago

Dear @xinntao,

I have two questions regarding running the EDVR model.

  1. I realized that everytime I submitted the job to a different gpu (even if they are of the same type e.g. Titan X), I have to do rm build/ and python develop again, otherwise I would get error in modulated_deformable_im2col_cuda; no kernel image is available for execution on the device. I suspect that it has something to do with dynamic installation? Right now, I had to keep three copies of the same repo in order to simultaneously run 3 jobs. Is it the way to go or is there any better options? I followed one of the sugguestions in another posts to have pytorch 1.4, torchvision 0.5 with cudatoolit 10.1

  2. I always get the following error and sometimes even explicit png CRC error when cv2.imdecode() returns None, and I realized that the training png's are somehow corrupted even though I verified all images before training. Did you encounter this problem before? Is it related to multi-processing data loading? This is happening everytime especially when I turn off the TSA and set frame to 1.

    Traceback (most recent call last):
    File "basicsr/", line 252, in <module>
    File "basicsr/", line 234, in main
    train_data =
    File "/scratch_net/biwidl216/huangsha/BasicSR_1/basicsr/data/", line 76, in next
    return next(self.loader)
    File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/", line 345, in __next__
    data = self._next_data()
    File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/", line 838, in _next_data
    return self._process_data(data)
    File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/", line 881, in _process_data
    File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/", line 394, in reraise
    raise self.exc_type(msg)
    AttributeError: Caught AttributeError in DataLoader worker process 1.
    Original Traceback (most recent call last):
    File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/_utils/", line 178, in _worker_loop
    data = fetcher.fetch(index)
    File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/_utils/", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
    File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/_utils/", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
    File "/scratch_net/biwidl216/huangsha/BasicSR_1/basicsr/data/", line 147, in __getitem__
    img_gt = imfrombytes(img_bytes, float32=True)
    File "/scratch_net/biwidl216/huangsha/BasicSR_1/basicsr/utils/", line 125, in imfrombytes
    img = img.astype(np.float32) / 255.
    AttributeError: 'NoneType' object has no attribute 'astype'

/scratch/slurm/spool/job219938/slurm_script: line 31: 16770 Bus error python -u basicsr/ -opt options/train/EDVR/train_EDVR_DARK_20_frame_window_1_patch_64.yml

Thank you very much for your help :) and best wishes!