run on different gpu? training images get corrupted during training?

Dear @xinntao,

I have two questions regarding running the EDVR model.

I realized that everytime I submitted the job to a different gpu (even if they are of the same type e.g. Titan X), I have to do rm build/ and python setup.py develop again, otherwise I would get error in modulated_deformable_im2col_cuda; no kernel image is available for execution on the device. I suspect that it has something to do with dynamic installation? Right now, I had to keep three copies of the same repo in order to simultaneously run 3 jobs. Is it the way to go or is there any better options? I followed one of the sugguestions in another posts to have pytorch 1.4, torchvision 0.5 with cudatoolit 10.1

I always get the following error and sometimes even explicit png CRC error when cv2.imdecode() returns None, and I realized that the training png's are somehow corrupted even though I verified all images before training. Did you encounter this problem before? Is it related to multi-processing data loading? This is happening everytime especially when I turn off the TSA and set frame to 1.


Traceback (most recent call last):
File "basicsr/train.py", line 252, in <module>
main()
File "basicsr/train.py", line 234, in main
train_data = prefetcher.next()
File "/scratch_net/biwidl216/huangsha/BasicSR_1/basicsr/data/prefetch_dataloader.py", line 76, in next
return next(self.loader)
File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 838, in _next_data
return self._process_data(data)
File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
AttributeError: Caught AttributeError in DataLoader worker process 1.
Original Traceback (most recent call last):
File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/scratch_net/biwidl216/huangsha/BasicSR_1/basicsr/data/moving_cityscape_dataset.py", line 147, in __getitem__
img_gt = imfrombytes(img_bytes, float32=True)
File "/scratch_net/biwidl216/huangsha/BasicSR_1/basicsr/utils/img_util.py", line 125, in imfrombytes
img = img.astype(np.float32) / 255.
AttributeError: 'NoneType' object has no attribute 'astype'

/scratch/slurm/spool/job219938/slurm_script: line 31: 16770 Bus error python -u basicsr/train.py -opt options/train/EDVR/train_EDVR_DARK_20_frame_window_1_patch_64.yml


Thank you very much for your help :) and best wishes!

xinntao / EDVR

run on different gpu? training images get corrupted during training? #187