Dataloader crashes if num_worker>0

Hi, xingyizhou, thanks for sharing the code! I have some troubles. If num_works = 0, we can train the network on kitti dataset well. However, if num_workers > 0, our training crashes:

ubuntu 16.04 pytorch 1.0.1.post2 python 3.6

~/Downloads/qingqing_disk/p4600_disk/CenterNet/src/lib/trains/base_trainer.py(63)run_epoch() 58 num_iters = len(data_loader) if opt.num_iters < 0 else opt.num_iters 59 bar = Bar('{}/{}'.format(opt.task, opt.exp_id), max=num_iters) 60 end = time.time() 61 import pdb 62 pdb.set_trace() 63 -> for iter_id, batch in enumerate(data_loader): 64 if iter_id >= num_iters:

~/anaconda3/lib/python3.6/sitepackages/torch/utils/data/dataloader.py(818)__iter__() 818 def __iter__(self): 819 -> return _DataLoaderIter(self)

~/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py(560)__init__() 557 # it started, so that we do not call .join() if program dies 558 # before it starts, and __del__ tries to join but will get: 559 # AssertionError: can only join a started process. 560 -> w.start() 561 self.index_queues.append(index_queue) 562 self.workers.append(w)

~/anaconda3/lib/python3.6/multiprocessing/process.py(105)start() 102 assert not _current_process._config.get('daemon'), \ 103 'daemonic processes are not allowed to have children' 104 _cleanup() 105 -> self._popen = self._Popen(self) 106 self._sentinel = self._popen.sentinel

~/anaconda3/lib/python3.6/multiprocessing/context.py(223)_Popen() 219 class Process(process.BaseProcess): 220 _start_method = None 221 @staticmethod 222 def _Popen(process_obj): 223 -> return _default_context.get_context().Process._Popen(process_obj)

~/anaconda3/lib/python3.6/multiprocessing/context.py(277)_Popen() 272 class ForkProcess(process.BaseProcess): 273 _start_method = 'fork' 274 @staticmethod 275 def _Popen(process_obj): 276 from .popen_fork import Popen 277 -> return Popen(process_obj)

~/anaconda3/lib/python3.6/multiprocessing/popen_fork.py(19)__init__() 16 def __init__(self, process_obj): 17 util._flush_std_streams() 18 self.returncode = None 19 -> self._launch(process_obj)

~/anaconda3/lib/python3.6/multiprocessing/popen_fork.py(66)_launch() 63 def _launch(self, process_obj): 64 code = 1 65 parent_r, child_w = os.pipe() 66 -> self.pid = os.fork() 67 if self.pid == 0: 68 try: 69 os.close(parent_r) 70 if 'random' in sys.modules: 71 import random

Here, self.pid = os.fork(), I can't step into the os.fork() function or press key n to train the networks. However, os.fork() seems OK in terminal as follows: qingqing@qingqing-PowerEdge-T630:~$ python Python 3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.fork() 23346 0 >>> >>>

My problem is similar to https://github.com/pytorch/pytorch/issues/25302 (He uses win10)

I got troubled. Could you help me? Thanks!

xingyizhou / CenterNet

Dataloader crashes if num_worker>0 #566