Open qingnanli opened 4 years ago
ddd/3dop |############################### | train: [5][305/309]|Tot: 0:11:30 |ETA: 0:00:10 |loss 2.1862 |hm_loss 0.5696 |dep_loss 0.1432 |dim_
ddd/3dop |############################### | train: [5][306/309]|Tot: 0:11:32 |ETA: 0:00:07 |loss 2.1857 |hm_loss 0.5695 |dep_loss 0.1429 |dim_
ddd/3dop |############################### | train: [5][307/309]|Tot: 0:11:35 |ETA: 0:00:05 |loss 2.1850 |hm_loss 0.5687 |dep_loss 0.1429 |dim_
ddd/3dop |################################| train: [5][308/309]|Tot: 0:11:37 |ETA: 0:00:03 |loss 2.1855 |hm_loss 0.5692 |dep_loss 0.1427 |dim_
loss 0.0091 |rot_loss 1.4146 |wh_loss 0.3049 |off_loss 0.0194 |Data 0.352s(0.367s) |Net 2.257s
ddd/3dop
If num_workers = 0, next epoch stops
Hi, xingyizhou, thanks for sharing the code! I have some troubles. If num_works = 0, we can train the network on kitti dataset well. However, if num_workers > 0, our training crashes:
ubuntu 16.04 pytorch 1.0.1.post2 python 3.6
~/Downloads/qingqing_disk/p4600_disk/CenterNet/src/lib/trains/base_trainer.py(63)run_epoch()
58 num_iters = len(data_loader) if opt.num_iters < 0 else opt.num_iters
59 bar = Bar('{}/{}'.format(opt.task, opt.exp_id), max=num_iters)
60 end = time.time()
61 import pdb
62 pdb.set_trace()
63 -> for iter_id, batch in enumerate(data_loader):
64 if iter_id >= num_iters:
~/anaconda3/lib/python3.6/sitepackages/torch/utils/data/dataloader.py(818)__iter__()
818 def __iter__(self):
819 -> return _DataLoaderIter(self)
~/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py(560)__init__()
557 # it started, so that we do not call .join() if program dies
558 # before it starts, and __del__ tries to join but will get:
559 # AssertionError: can only join a started process.
560 -> w.start()
561 self.index_queues.append(index_queue)
562 self.workers.append(w)
~/anaconda3/lib/python3.6/multiprocessing/process.py(105)start()
102 assert not _current_process._config.get('daemon'), \
103 'daemonic processes are not allowed to have children'
104 _cleanup()
105 -> self._popen = self._Popen(self)
106 self._sentinel = self._popen.sentinel
~/anaconda3/lib/python3.6/multiprocessing/context.py(223)_Popen()
219 class Process(process.BaseProcess):
220 _start_method = None
221 @staticmethod
222 def _Popen(process_obj):
223 -> return _default_context.get_context().Process._Popen(process_obj)
~/anaconda3/lib/python3.6/multiprocessing/context.py(277)_Popen()
272 class ForkProcess(process.BaseProcess):
273 _start_method = 'fork'
274 @staticmethod
275 def _Popen(process_obj):
276 from .popen_fork import Popen
277 -> return Popen(process_obj)
~/anaconda3/lib/python3.6/multiprocessing/popen_fork.py(19)__init__()
16 def __init__(self, process_obj):
17 util._flush_std_streams()
18 self.returncode = None
19 -> self._launch(process_obj)
~/anaconda3/lib/python3.6/multiprocessing/popen_fork.py(66)_launch()
63 def _launch(self, process_obj):
64 code = 1
65 parent_r, child_w = os.pipe()
66 -> self.pid = os.fork()
67 if self.pid == 0:
68 try:
69 os.close(parent_r)
70 if 'random' in sys.modules:
71 import random
Here, self.pid = os.fork(), I can't step into the os.fork() function or press key n to train the networks. However, os.fork() seems OK in terminal as follows:
qingqing@qingqing-PowerEdge-T630:~$ python
Python 3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.fork()
23346
0
>>> >>>
My problem is similar to https://github.com/pytorch/pytorch/issues/25302 (He uses win10)
I got troubled. Could you help me? Thanks!