Closed Mandylove1993 closed 2 years ago
Met the same issue when doing multi-gpu training. You need to change the -1 to the actual number of channels in that dimension to fix this issue. There are probably also other places in the code with the similar issue.
Met the same issue when doing multi-gpu training. You need to change the -1 to the actual number of channels in that dimension to fix this issue. There are probably also other places in the code with the similar issue.
Thanks, I will try
What can I do for it? I use 4 gpus to train data with arg"--num_gpus=4",and it can train and val,but one hours later,error occured: `Traceback (most recent call last): File "tools/plain_train_net.py", line 161, in
args=(args,),
File "/zft/code/MonoFlex/engine/launch.py", line 54, in launch
daemon=False,
File "/root/anaconda3/envs/monoflex/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/anaconda3/envs/monoflex/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/root/anaconda3/envs/monoflex/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error: Traceback (most recent call last): File "/root/anaconda3/envs/monoflex/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "/zft/code/MonoFlex/engine/launch.py", line 89, in _distributed_worker main_func(args) File "/zft/code/MonoFlex/tools/plain_train_net.py", line 140, in main train(cfg, model, device, distributed) File "/zft/code/MonoFlex/tools/plain_train_net.py", line 84, in train arguments, File "/zft/code/MonoFlex/engine/trainer.py", line 109, in do_train loss_dict, log_loss_dict = model(images, targets) File "/root/anaconda3/envs/monoflex/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/root/anaconda3/envs/monoflex/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 619, in forward output = self.module(*inputs[0], *kwargs[0]) File "/root/anaconda3/envs/monoflex/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/zft/code/MonoFlex/model/detector.py", line 34, in forward loss_dict, log_loss_dict = self.heads(features, targets) File "/root/anaconda3/envs/monoflex/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/zft/code/MonoFlex/model/head/detector_head.py", line 21, in forward loss_dict, log_loss_dict = self.loss_evaluator(x, targets) File "/zft/code/MonoFlex/model/head/detector_loss.py", line 271, in call pred_targets, preds, reg_nums, weights = self.prepare_predictions(targets_variables, predictions) File "/zft/code/MonoFlex/model/head/detector_loss.py", line 153, in prepare_predictions target_corners_3D = self.anno_encoder.encode_box3d(target_rotys_3D, target_dimensions_3D, target_locations_3D) File "/zft/code/MonoFlex/model/anno_encoder.py", line 108, in encode_box3d ry = self.rad_to_matrix(rotys, N) File "/zft/code/MonoFlex/model/anno_encoder.py", line 60, in rad_to_matrix ry = i_temp.repeat(N, 1).view(N, -1, 3) RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1, 3] because the unspecified dimension size -1 can be any value and is ambiguous`**