xingyizhou / ExtremeNet

Bottom-up Object Detection by Grouping Extreme and Center Points
BSD 3-Clause "New" or "Revised" License
1.03k stars 172 forks source link

Train with my dataset Runtime error: Device index must be -1 or non-negative #45

Closed EmilCreatePro closed 4 years ago

EmilCreatePro commented 4 years ago

Hello @xingyizhou, I am trying to train this network with my own dataset and I keep getting this Device index must be -1 or non-negative error (see below):

_Traceback (most recent call last): File "train.py", line 225, in train(training_dbs, None, args.start_iter, args.debug) File "train.py", line 159, in train training_loss = nnet.train(*training) File "/content/ExtremeNet/nnet/py_factory.py", line 83, in train loss = self.network(xs, ys) File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/content/ExtremeNet/models/py_utils/data_parallel.py", line 66, in forward inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids, self.chunk_sizes) File "/content/ExtremeNet/models/py_utils/data_parallel.py", line 77, in scatter return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim, chunk_sizes=self.chunk_sizes) File "/content/ExtremeNet/models/py_utils/scatter_gather.py", line 30, in scatter_kwargs inputs = scatter(inputs, target_gpus, dim, chunk_sizes) if inputs else [] File "/content/ExtremeNet/models/py_utils/scatter_gather.py", line 25, in scatter return scatter_map(inputs) File "/content/ExtremeNet/models/py_utils/scatter_gather.py", line 18, in scatter_map return list(zip(map(scatter_map, obj))) File "/content/ExtremeNet/models/py_utils/scatter_gather.py", line 20, in scatter_map return list(map(list, zip(map(scatter_map, obj)))) File "/content/ExtremeNet/models/py_utils/scatter_gather.py", line 15, in scatter_map return Scatter.apply(target_gpus, chunk_sizes, dim, obj) File "/usr/local/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 89, in forward outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams) File "/usr/local/lib/python3.7/site-packages/torch/cuda/comm.py", line 148, in scatter return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams)) RuntimeError: Device index must be -1 or non-negative, got -14913 (Device at /pytorch/c10/Device.h:40)** frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f4cef83c021 in /usr/local/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f4cef83b8ea in /usr/local/lib/python3.7/site-packages/torch/lib/libc10.so) frame #2: + 0x10ceca (0x7f4d29b74eca in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #3: torch::cuda::scatter(at::Tensor const&, c10::ArrayRef, c10::optional<std::vector<long, std::allocator > > const&, long, c10::optional<std::vector<c10::optional, std::allocator<c10::optional > > > const&) + 0x2dc (0x7f4d29f4faac in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x4ed28f (0x7f4d29f5528f in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #5: + 0x11663e (0x7f4d29b7e63e in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #13: THPFunction_apply(_object*, _object*) + 0x5a1 (0x7f4d29d7b961 in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)_ **Do you have some sort of explanations why this happens?** If I run demo.py I get no errors.. so I guess the environment is setup correctly :(
EmilCreatePro commented 4 years ago

I manage to fix this by going into nnet/py_factory.py and change the imports:

HOW IT WAS : _from models.py_utils.data_parallel import DataParallel_

CHANGED TO: from torch.nn import DataParallel

And also had to remove the chunk_size parameter in the DataParallel function call :)

Don't know how much it would change the performance tho.. other solution I didn't found