xingyizhou / CenterNet

Object detection, 3D detection, and pose estimation using center point detection:
MIT License
7.29k stars 1.93k forks source link

Dataloader crashes if num_worker>0 #566

Open qingnanli opened 4 years ago

qingnanli commented 4 years ago

Hi, xingyizhou, thanks for sharing the code! I have some troubles. If num_works = 0, we can train the network on kitti dataset well. However, if num_workers > 0, our training crashes:

ubuntu 16.04 pytorch 1.0.1.post2 python 3.6

~/Downloads/qingqing_disk/p4600_disk/CenterNet/src/lib/trains/base_trainer.py(63)run_epoch() 58 num_iters = len(data_loader) if opt.num_iters < 0 else opt.num_iters 59 bar = Bar('{}/{}'.format(opt.task, opt.exp_id), max=num_iters) 60 end = time.time() 61 import pdb 62 pdb.set_trace() 63 -> for iter_id, batch in enumerate(data_loader): 64 if iter_id >= num_iters:

~/anaconda3/lib/python3.6/sitepackages/torch/utils/data/dataloader.py(818)__iter__() 818 def __iter__(self): 819 -> return _DataLoaderIter(self)

~/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py(560)__init__() 557 # it started, so that we do not call .join() if program dies 558 # before it starts, and __del__ tries to join but will get: 559 # AssertionError: can only join a started process. 560 -> w.start() 561 self.index_queues.append(index_queue) 562 self.workers.append(w)

~/anaconda3/lib/python3.6/multiprocessing/process.py(105)start() 102 assert not _current_process._config.get('daemon'), \ 103 'daemonic processes are not allowed to have children' 104 _cleanup() 105 -> self._popen = self._Popen(self) 106 self._sentinel = self._popen.sentinel

~/anaconda3/lib/python3.6/multiprocessing/context.py(223)_Popen() 219 class Process(process.BaseProcess): 220 _start_method = None 221 @staticmethod 222 def _Popen(process_obj): 223 -> return _default_context.get_context().Process._Popen(process_obj)

~/anaconda3/lib/python3.6/multiprocessing/context.py(277)_Popen() 272 class ForkProcess(process.BaseProcess): 273 _start_method = 'fork' 274 @staticmethod 275 def _Popen(process_obj): 276 from .popen_fork import Popen 277 -> return Popen(process_obj)

~/anaconda3/lib/python3.6/multiprocessing/popen_fork.py(19)__init__() 16 def __init__(self, process_obj): 17 util._flush_std_streams() 18 self.returncode = None 19 -> self._launch(process_obj)

~/anaconda3/lib/python3.6/multiprocessing/popen_fork.py(66)_launch() 63 def _launch(self, process_obj): 64 code = 1 65 parent_r, child_w = os.pipe() 66 -> self.pid = os.fork() 67 if self.pid == 0: 68 try: 69 os.close(parent_r) 70 if 'random' in sys.modules: 71 import random

Here, self.pid = os.fork(), I can't step into the os.fork() function or press key n to train the networks. However, os.fork() seems OK in terminal as follows: qingqing@qingqing-PowerEdge-T630:~$ python Python 3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.fork() 23346 0 >>> >>>

My problem is similar to https://github.com/pytorch/pytorch/issues/25302 (He uses win10)

I got troubled. Could you help me? Thanks!

qingnanli commented 4 years ago

ddd/3dop |############################### | train: [5][305/309]|Tot: 0:11:30 |ETA: 0:00:10 |loss 2.1862 |hm_loss 0.5696 |dep_loss 0.1432 |dim_ ddd/3dop |############################### | train: [5][306/309]|Tot: 0:11:32 |ETA: 0:00:07 |loss 2.1857 |hm_loss 0.5695 |dep_loss 0.1429 |dim_ ddd/3dop |############################### | train: [5][307/309]|Tot: 0:11:35 |ETA: 0:00:05 |loss 2.1850 |hm_loss 0.5687 |dep_loss 0.1429 |dim_ ddd/3dop |################################| train: [5][308/309]|Tot: 0:11:37 |ETA: 0:00:03 |loss 2.1855 |hm_loss 0.5692 |dep_loss 0.1427 |dim_ loss 0.0091 |rot_loss 1.4146 |wh_loss 0.3049 |off_loss 0.0194 |Data 0.352s(0.367s) |Net 2.257s ddd/3dop

If num_workers = 0, next epoch stops