Use multi-gpu to train my dataset,but got the error" ry = i_temp.repeat(N, 1).view(N, -1, 3) RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1, 3] because the unspecified dimension size -1 can be any value and is ambiguous"

Mandylove1993 commented 2 years ago

What can I do for it? I use 4 gpus to train data with arg"--num_gpus=4",and it can train and val,but one hours later,error occured: `Traceback (most recent call last): File "tools/plain_train_net.py", line 161, in args=(args,), File "/zft/code/MonoFlex/engine/launch.py", line 54, in launch daemon=False, File "/root/anaconda3/envs/monoflex/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/anaconda3/envs/monoflex/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/root/anaconda3/envs/monoflex/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/root/anaconda3/envs/monoflex/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "/zft/code/MonoFlex/engine/launch.py", line 89, in _distributed_worker main_func(args) File "/zft/code/MonoFlex/tools/plain_train_net.py", line 140, in main train(cfg, model, device, distributed) File "/zft/code/MonoFlex/tools/plain_train_net.py", line 84, in train arguments, File "/zft/code/MonoFlex/engine/trainer.py", line 109, in do_train loss_dict, log_loss_dict = model(images, targets) File "/root/anaconda3/envs/monoflex/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/root/anaconda3/envs/monoflex/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 619, in forward output = self.module(*inputs[0], *kwargs[0]) File "/root/anaconda3/envs/monoflex/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/zft/code/MonoFlex/model/detector.py", line 34, in forward loss_dict, log_loss_dict = self.heads(features, targets) File "/root/anaconda3/envs/monoflex/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/zft/code/MonoFlex/model/head/detector_head.py", line 21, in forward loss_dict, log_loss_dict = self.loss_evaluator(x, targets) File "/zft/code/MonoFlex/model/head/detector_loss.py", line 271, in call pred_targets, preds, reg_nums, weights = self.prepare_predictions(targets_variables, predictions) File "/zft/code/MonoFlex/model/head/detector_loss.py", line 153, in prepare_predictions target_corners_3D = self.anno_encoder.encode_box3d(target_rotys_3D, target_dimensions_3D, target_locations_3D) File "/zft/code/MonoFlex/model/anno_encoder.py", line 108, in encode_box3d ry = self.rad_to_matrix(rotys, N) File "/zft/code/MonoFlex/model/anno_encoder.py", line 60, in rad_to_matrix ry = i_temp.repeat(N, 1).view(N, -1, 3) RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1, 3] because the unspecified dimension size -1 can be any value and is ambiguous`**

minghanz commented 2 years ago

Met the same issue when doing multi-gpu training. You need to change the -1 to the actual number of channels in that dimension to fix this issue. There are probably also other places in the code with the similar issue.

Mandylove1993 commented 2 years ago

Met the same issue when doing multi-gpu training. You need to change the -1 to the actual number of channels in that dimension to fix this issue. There are probably also other places in the code with the similar issue.

Thanks， I will try

zhangyp15 / MonoFlex

Use multi-gpu to train my dataset,but got the error" ry = i_temp.repeat(N, 1).view(N, -1, 3) RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1, 3] because the unspecified dimension size -1 can be any value and is ambiguous" #40