Closed JD0316 closed 1 week ago
Another problem is when I train the model with 2 GPUs I get an error , and then I try to train again , but it can`t running , here is error information:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 134 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 134 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
And then I delete all checkpoint ,and restart training with 2GPUs , a exception appear in epoch: 1 iter: 14112/15448
, here is exception information:
Traceback (most recent call last):
File "/home/jijiandu/keyan/HEDNet-main/tools/train.py", line 239, in <module>
main()
File "/home/jijiandu/keyan/HEDNet-main/tools/train.py", line 184, in main
train_model(
File "/home/jijiandu/keyan/HEDNet-main/tools/train_utils/train_utils.py", line 236, in train_model
accumulated_iter = train_one_epoch(
File "/home/jijiandu/keyan/HEDNet-main/tools/train_utils/train_utils.py", line 47, in train_one_epoch
batch = next(dataloader_iter)
File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
return self._process_data(data)
File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
data.reraise()
File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise
raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/nuscenes/nuscenes_dataset.py", line 264, in __getitem__
data_dict = self.prepare_data(data_dict=input_dict)
File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/dataset.py", line 224, in prepare_data
data_dict = self.data_augmentor.forward(
File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/data_augmentor.py", line 302, in forward
data_dict = cur_augmentor(data_dict=data_dict)
File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/database_sampler.py", line 627, in __call__
data_dict = self.add_sampled_boxes_to_scene(
File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/database_sampler.py", line 501, in add_sampled_boxes_to_scene
gt_database_data = SharedArray.attach(f"shm://{self.gt_database_data_key}")
FileNotFoundError: [Errno 2] No such file or directory: 'shm://nuscenes_10sweeps_withvelo_lidar.npy'
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
If I modify USE_SHARED_MEMORY from true to fasle, it works fine . Will this configuration affect the result?
Looking forward to your help, thanks!
when I train model with a gpu , I find GPU utilization is only about 40%,and GPU utilization is about 50% with 2GPUS ,any problem ?
We have not tried to train the model with 1 or 2 GPUs, All the models in the paper were trained with 8 GPUs.
Another problem is when I train the model with 2 GPUs I get an error , and then I try to train again , but it can`t running , here is error information:
Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) _pickle.UnpicklingError: pickle data was truncated /home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 134 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 134 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
And then I delete all checkpoint ,and restart training with 2GPUs , a exception appear in
epoch: 1 iter: 14112/15448
, here is exception information:Traceback (most recent call last): File "/home/jijiandu/keyan/HEDNet-main/tools/train.py", line 239, in <module> main() File "/home/jijiandu/keyan/HEDNet-main/tools/train.py", line 184, in main train_model( File "/home/jijiandu/keyan/HEDNet-main/tools/train_utils/train_utils.py", line 236, in train_model accumulated_iter = train_one_epoch( File "/home/jijiandu/keyan/HEDNet-main/tools/train_utils/train_utils.py", line 47, in train_one_epoch batch = next(dataloader_iter) File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in __next__ data = self._next_data() File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data return self._process_data(data) File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data data.reraise() File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise raise exception FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/nuscenes/nuscenes_dataset.py", line 264, in __getitem__ data_dict = self.prepare_data(data_dict=input_dict) File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/dataset.py", line 224, in prepare_data data_dict = self.data_augmentor.forward( File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/data_augmentor.py", line 302, in forward data_dict = cur_augmentor(data_dict=data_dict) File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/database_sampler.py", line 627, in __call__ data_dict = self.add_sampled_boxes_to_scene( File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/database_sampler.py", line 501, in add_sampled_boxes_to_scene gt_database_data = SharedArray.attach(f"shm://{self.gt_database_data_key}") FileNotFoundError: [Errno 2] No such file or directory: 'shm://nuscenes_10sweeps_withvelo_lidar.npy' Exception ignored in: <Finalize object, dead> Traceback (most recent call last): File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/util.py", line 224, in __call__ res = self._callback(*self._args, **self._kwargs) File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup sem_unlink(name) FileNotFoundError: [Errno 2] No such file or directory
If I modify USE_SHARED_MEMORY from true to fasle, it works fine . Will this configuration affect the result?
Looking forward to your help, thanks!
If setting USE_SHARED_MEMORY to False works for you, you can do like that. In this case, you can not set USE_SHARED_MEMORY to True when you generate the training labels, otherwise it will affect the generation of gt_database.
Another problem is when I train the model with 2 GPUs I get an error , and then I try to train again , but it can`t running , here is error information:
Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) _pickle.UnpicklingError: pickle data was truncated /home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 134 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 134 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
And then I delete all checkpoint ,and restart training with 2GPUs , a exception appear in
epoch: 1 iter: 14112/15448
, here is exception information:Traceback (most recent call last): File "/home/jijiandu/keyan/HEDNet-main/tools/train.py", line 239, in <module> main() File "/home/jijiandu/keyan/HEDNet-main/tools/train.py", line 184, in main train_model( File "/home/jijiandu/keyan/HEDNet-main/tools/train_utils/train_utils.py", line 236, in train_model accumulated_iter = train_one_epoch( File "/home/jijiandu/keyan/HEDNet-main/tools/train_utils/train_utils.py", line 47, in train_one_epoch batch = next(dataloader_iter) File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in __next__ data = self._next_data() File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data return self._process_data(data) File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data data.reraise() File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise raise exception FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/jijiandu/.conda/envs/safd/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/nuscenes/nuscenes_dataset.py", line 264, in __getitem__ data_dict = self.prepare_data(data_dict=input_dict) File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/dataset.py", line 224, in prepare_data data_dict = self.data_augmentor.forward( File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/data_augmentor.py", line 302, in forward data_dict = cur_augmentor(data_dict=data_dict) File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/database_sampler.py", line 627, in __call__ data_dict = self.add_sampled_boxes_to_scene( File "/home/jijiandu/keyan/HEDNet-main/tools/../pcdet/datasets/augmentor/database_sampler.py", line 501, in add_sampled_boxes_to_scene gt_database_data = SharedArray.attach(f"shm://{self.gt_database_data_key}") FileNotFoundError: [Errno 2] No such file or directory: 'shm://nuscenes_10sweeps_withvelo_lidar.npy' Exception ignored in: <Finalize object, dead> Traceback (most recent call last): File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/util.py", line 224, in __call__ res = self._callback(*self._args, **self._kwargs) File "/home/jijiandu/.conda/envs/safd/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup sem_unlink(name) FileNotFoundError: [Errno 2] No such file or directory
If I modify USE_SHARED_MEMORY from true to fasle, it works fine . Will this configuration affect the result? Looking forward to your help, thanks!
If setting USE_SHARED_MEMORY to False works for you, you can do like that. In this case, you can not set USE_SHARED_MEMORY to True when you generate the training labels, otherwise it will affect the generation of gt_database.
Thank you for your help!
when I train model with a gpu , I find GPU utilization is only about 40%,and GPU utilization is about 50% with 2GPUS ,any problem ?