Open SeongwoongCho opened 4 years ago
my cpu info is 5 vCPU / Mem 32Gb
when i write lscpu on the terminal,
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz Stepping: 7 CPU MHz: 1000.649 CPU max MHz: 3200.0000 CPU min MHz: 1000.0000 BogoMIPS: 4400.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 14080K NUMA node0 CPU(s): 0-9,20-29 NUMA node1 CPU(s): 10-19,30-39
@zylo117 i think there would be shared memory leak somewhere on your code. my shared memory size is 126GB . and my custom data set's annotation size is 1.4GB,0.6GB for train,valid set
num gpu = 1 workers = 8 ok num gpu = 2 workers =8 ok (but slow than num gpu = 1, low gpu utils) num gpu = 2 workers >=16 fail num gpu >= 3 workers =8 fail
No, not on mine. It's a common issue of dataloader of pytorch. For now, I would suggest using less num_worker. And I'm not sure how you can set a shared mem to 126G when you have only 32G physical memory. But in the end, you can only utilize at most 32G.
I am facing similar problem.....! Unfortunately I dont know where /how to set that number of workers=0?
kindly help me how to reach there? I am using Colab.
train with -n 0
train with
-n 0
Thanks for your reply.
I am Training GRAY SCALE IMAGE, do I need tomake any changes in mean and std at project file (-project name-.yml) as shown here below?
mean: [0.485, 0.456, 0.406] std: [0.229, 0.224, 0.225]
try setting mean and std to [0.5, 0.5, 0.5]
i run my codes on [RTX2080ti*6] system
python train.py -c 0 -p mydata -n 16 --data_path ../datasets/ --batch_size 64 --lr 1e-5 --num_epochs 20 --load_weights ./weights/efficientdet-d0.pth --head_only True --optim sgd
Traceback (most recent call last): File "/home/jovyan/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/usr/lib/python3.6/queue.py", line 173, in get self.not_empty.wait(remaining) File "/usr/lib/python3.6/threading.py", line 299, in wait gotit = waiter.acquire(True, timeout) File "/home/jovyan/.local/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 14293) is killed by signal: Killed.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "train.py", line 333, in
train(opt)
File "train.py", line 218, in train
for iter, data in enumerate(progress_bar):
File "/home/jovyan/.local/lib/python3.6/site-packages/tqdm/std.py", line 1129, in iter
for obj in iterable:
File "/home/jovyan/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/home/jovyan/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
idx, data = self._get_data()
File "/home/jovyan/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 798, in _get_data
success, data = self._try_get_data()
File "/home/jovyan/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 774, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 14293) exited unexpectedly