zhoubolei / TRN-pytorch

Temporal Relation Networks
http://relation.csail.mit.edu/
Other
787 stars 190 forks source link

RuntimeError: CUDA error: misaligned address #88

Open xiongsiheng opened 4 years ago

xiongsiheng commented 4 years ago

Hi everyone! I ran into a strange bug which confused me several days. Sometimes the model will run into this error after dozens of epochs(like 40, 80 or 100). Sometimes this error disappears. When the model is resumed from the checkpoints saved before the error, this error may or may not appear again. Does anyone know the situation? Any reply will be appreciated.

When I use py36+torch1.4+cuda10.0, it shows: Traceback (most recent call last): File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward input = module(input) File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, **kwargs) File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward return F.linear(input, self.weight, self.bias) File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/functional.py", line 1370, in linear ret = torch.addmm(bias, input, weight.t()) RuntimeError: CUDA error: misaligned address

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: misaligned address (insert_events at /opt/conda/conda-bld/pytorch_1579027003190/work/c10/cuda/CUDACachingAllocator.cpp:764) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f172f1a1627 in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: + 0x1ab04 (0x7f172f3e1b04 in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #2: + 0x1cbd1 (0x7f172f3e3bd1 in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7f172f18eb9d in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #4: + 0x6871fa (0x7f17606161fa in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #20: __libc_start_main + 0xe7 (0x7f1772122b97 in /lib/x86_64-linux-gnu/libc.so.6) **When I use py35+torch0.4+cuda9.0, it shows:** Traceback (most recent call last): File "main.py", line 329, in main() File "main.py", line 128, in main train(train_loader, model, criterion, optimizer, epoch, log_training) File "main.py", line 170, in train output = model(input_var) File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply raise output File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker output = module(*input, **kwargs) File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/xxx/xxx/xxx/project/TRN-pytorch/models.py", line 220, in forward base_out = self.base_model(input.view((-1, sample_len) + input.size()[-2:])) File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/xxx/xxx/xxx/project/TRN-pytorch/model_zoo/bninception/pytorch_load.py", line 57, in forward data_dict[op[2]] = torch.cat(tuple(data_dict[x] for x in op[-1]), 1) RuntimeError: cuda runtime error (74) : misaligned address at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/THCCachingHostAllocator.cpp:271 terminate called after throwing an instance of 'at::Error' what(): CUDA error: invalid device pointer (CudaCachingDeleter at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/THCCachingAllocator.cpp:498) frame #0: THStorage_free + 0x44 (0x7fc7bba51a04 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/lib/libcaffe2.so) frame #1: THTensor_free + 0x2f (0x7fc7bbaff66f in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/lib/libcaffe2.so) frame #2: at::CUDAFloatTensor::~CUDAFloatTensor() + 0x9 (0x7fc7a64ac609 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/lib/libcaffe2_gpu.so) frame #3: torch::autograd::Variable::Impl::~Impl() + 0x1f7 (0x7fc7bd6c62d7 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so) frame #4: torch::autograd::Variable::Impl::~Impl() + 0x9 (0x7fc7bd6c6429 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so) frame #5: + 0x6e8a44 (0x7fc7bd6dda44 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so) frame #6: + 0x6e8b24 (0x7fc7bd6ddb24 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so) frame #23: __libc_start_main + 0xe7 (0x7fc7cec3bb97 in /lib/x86_64-linux-gnu/libc.so.6)