zhanggang001 / HEDNet

HEDNet (NeurIPS 2023) & SAFDNet (CVPR 2024 Oral)
Apache License 2.0
88 stars 7 forks source link

About Exceptions In Training #18

Closed JD0316 closed 1 week ago

JD0316 commented 3 weeks ago

when I train model with a gpu , I find GPU utilization is only about 40%,and GPU utilization is about 50% with 2GPUS ,any problem ?

JD0316 commented 3 weeks ago

Another problem is when I train the model with 2 GPUs I get an error , and then I try to train again , but it can`t running , here is error information:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 134 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 134 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

And then I delete all checkpoint ,and restart training with 2GPUs , a exception appear in epoch: 1 iter: 14112/15448 , here is exception information:

Traceback (most recent call last):
  File "/home/jijiandu/keyan/HEDNet-main/tools/train.py", line 239, in <module>
    main()
  File "/home/jijiandu/keyan/HEDNet-main/tools/train.py", line 184, in main
    train_model(
  File "/home/jijiandu/keyan/HEDNet-main/tools/train_utils/train_utils.py", line 236, in train_model
    accumulated_iter = train_one_epoch(
  File "/home/jijiandu/keyan/HEDNet-main/tools/train_utils/train_utils.py", line 47, in train_one_epoch
    batch = next(dataloader_iter)
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/nuscenes/nuscenes_dataset.py", line 264, in __getitem__
    data_dict = self.prepare_data(data_dict=input_dict)
  File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/dataset.py", line 224, in prepare_data
    data_dict = self.data_augmentor.forward(
  File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/data_augmentor.py", line 302, in forward
    data_dict = cur_augmentor(data_dict=data_dict)
  File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/database_sampler.py", line 627, in __call__
    data_dict = self.add_sampled_boxes_to_scene(
  File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/database_sampler.py", line 501, in add_sampled_boxes_to_scene
    gt_database_data = SharedArray.attach(f"shm://{self.gt_database_data_key}")
FileNotFoundError: [Errno 2] No such file or directory: 'shm://nuscenes_10sweeps_withvelo_lidar.npy'

Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory

If I modify USE_SHARED_MEMORY from true to fasle, it works fine . Will this configuration affect the result?

Looking forward to your help, thanks!

zhanggang001 commented 1 week ago

when I train model with a gpu , I find GPU utilization is only about 40%,and GPU utilization is about 50% with 2GPUS ,any problem ?

We have not tried to train the model with 1 or 2 GPUs, All the models in the paper were trained with 8 GPUs.

zhanggang001 commented 1 week ago

Another problem is when I train the model with 2 GPUs I get an error , and then I try to train again , but it can`t running , here is error information:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 134 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 134 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

And then I delete all checkpoint ,and restart training with 2GPUs , a exception appear in epoch: 1 iter: 14112/15448 , here is exception information:

Traceback (most recent call last):
  File "/home/jijiandu/keyan/HEDNet-main/tools/train.py", line 239, in <module>
    main()
  File "/home/jijiandu/keyan/HEDNet-main/tools/train.py", line 184, in main
    train_model(
  File "/home/jijiandu/keyan/HEDNet-main/tools/train_utils/train_utils.py", line 236, in train_model
    accumulated_iter = train_one_epoch(
  File "/home/jijiandu/keyan/HEDNet-main/tools/train_utils/train_utils.py", line 47, in train_one_epoch
    batch = next(dataloader_iter)
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/nuscenes/nuscenes_dataset.py", line 264, in __getitem__
    data_dict = self.prepare_data(data_dict=input_dict)
  File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/dataset.py", line 224, in prepare_data
    data_dict = self.data_augmentor.forward(
  File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/data_augmentor.py", line 302, in forward
    data_dict = cur_augmentor(data_dict=data_dict)
  File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/database_sampler.py", line 627, in __call__
    data_dict = self.add_sampled_boxes_to_scene(
  File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/database_sampler.py", line 501, in add_sampled_boxes_to_scene
    gt_database_data = SharedArray.attach(f"shm://{self.gt_database_data_key}")
FileNotFoundError: [Errno 2] No such file or directory: 'shm://nuscenes_10sweeps_withvelo_lidar.npy'

Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory

If I modify USE_SHARED_MEMORY from true to fasle, it works fine . Will this configuration affect the result?

Looking forward to your help, thanks!

If setting USE_SHARED_MEMORY to False works for you, you can do like that. In this case, you can not set USE_SHARED_MEMORY to True when you generate the training labels, otherwise it will affect the generation of gt_database.

JD0316 commented 1 week ago

Another problem is when I train the model with 2 GPUs I get an error , and then I try to train again , but it can`t running , here is error information:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 134 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 134 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

And then I delete all checkpoint ,and restart training with 2GPUs , a exception appear in epoch: 1 iter: 14112/15448 , here is exception information:

Traceback (most recent call last):
  File "/home/jijiandu/keyan/HEDNet-main/tools/train.py", line 239, in <module>
    main()
  File "/home/jijiandu/keyan/HEDNet-main/tools/train.py", line 184, in main
    train_model(
  File "/home/jijiandu/keyan/HEDNet-main/tools/train_utils/train_utils.py", line 236, in train_model
    accumulated_iter = train_one_epoch(
  File "/home/jijiandu/keyan/HEDNet-main/tools/train_utils/train_utils.py", line 47, in train_one_epoch
    batch = next(dataloader_iter)
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/nuscenes/nuscenes_dataset.py", line 264, in __getitem__
    data_dict = self.prepare_data(data_dict=input_dict)
  File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/dataset.py", line 224, in prepare_data
    data_dict = self.data_augmentor.forward(
  File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/data_augmentor.py", line 302, in forward
    data_dict = cur_augmentor(data_dict=data_dict)
  File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/database_sampler.py", line 627, in __call__
    data_dict = self.add_sampled_boxes_to_scene(
  File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/database_sampler.py", line 501, in add_sampled_boxes_to_scene
    gt_database_data = SharedArray.attach(f"shm://{self.gt_database_data_key}")
FileNotFoundError: [Errno 2] No such file or directory: 'shm://nuscenes_10sweeps_withvelo_lidar.npy'

Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory

If I modify USE_SHARED_MEMORY from true to fasle, it works fine . Will this configuration affect the result? Looking forward to your help, thanks!

If setting USE_SHARED_MEMORY to False works for you, you can do like that. In this case, you can not set USE_SHARED_MEMORY to True when you generate the training labels, otherwise it will affect the generation of gt_database.

Thank you for your help!