zhulf0804 / GCNet

Leveraging Inlier Correspondences Proportion for Point Cloud Registration. https://arxiv.org/abs/2201.12094.
MIT License
96 stars 12 forks source link

ERROR: Unexpected segmentation fault encountered in worker. #5

Closed ttsesm closed 2 years ago

ttsesm commented 2 years ago

Any idea how to fix the following error:

eval_mvp_rg.py --data_root datasets/mvp_rg --checkpoint weights/mvp_rg.pth --vis
[20 31 34]
  0%|          | 0/1200 [00:00<?, ?it/s]ERROR: Unexpected segmentation fault encountered in worker.
 ERROR: Unexpected segmentation fault encountered in worker.
 ERROR: Unexpected segmentation fault encountered in worker.
 ERROR: Unexpected segmentation fault encountered in worker.
  0%|          | 0/1200 [00:17<?, ?it/s]
Traceback (most recent call last):
  File "/home/ttsesm/Development/NgeNet/venv_ngenet/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 986, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.9/multiprocessing/queues.py", line 113, in get
    if not self._poll(timeout):
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 262, in poll
    return self._poll(timeout)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 429, in _poll
    r = wait([self], timeout)
  File "/usr/lib/python3.9/multiprocessing/connection.py", line 936, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.9/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/ttsesm/Development/NgeNet/venv_ngenet/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2845706) is killed by signal: Segmentation fault. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ttsesm/Development/NgeNet/eval_mvp_rg.py", line 174, in <module>
    main(args)
  File "/home/ttsesm/Development/NgeNet/eval_mvp_rg.py", line 53, in main
    for pair_ind, inputs in enumerate(tqdm(test_dataloader)):
  File "/home/ttsesm/Development/NgeNet/venv_ngenet/lib/python3.9/site-packages/tqdm/std.py", line 1180, in __iter__
    for obj in iterable:
  File "/home/ttsesm/Development/NgeNet/venv_ngenet/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
    data = self._next_data()
  File "/home/ttsesm/Development/NgeNet/venv_ngenet/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1182, in _next_data
    idx, data = self._get_data()
  File "/home/ttsesm/Development/NgeNet/venv_ngenet/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1148, in _get_data
    success, data = self._try_get_data()
  File "/home/ttsesm/Development/NgeNet/venv_ngenet/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 999, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 2845706) exited unexpectedly

Process finished with exit code 1

Thanks.

zhulf0804 commented 2 years ago

Thanks for your interest.

I guess it's the problem related to Open3D version.

Which version of Open3D do you use ? Open3D v0.10.0.0 is recommended. Maybe you can try this version of Open3D.

Another way you can try is to set config.num_workers = 0 in https://github.com/zhulf0804/NgeNet/blob/d4917f22e55195132ec6fc602554102d321ce4b5/eval_mvp_rg.py#L26

Best regards.

ttsesm commented 2 years ago

Yes, you were right. I installed the correct versions in the packages and problem was resolved. The only difference was that I had to upgrade the torch version to 1.9.0 since with the 1.8.1 I was getting the following error RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when callingcublasSgemmStridedBatched` which apparently was resolved in the newer version.

Thanks.