ssli23 commented 3 months ago

Question

Modify resume_from_checkpoint to 'checkpoints/salines_detr_desnet50_800_1333_com_2x. pth'

When I use the dual card training command: CUDA_VISIBLEDEVICES=0,1 accelerate launch main. py RuntimeError: Tensors must be CUDA and dense WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 96188 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 96189) of binary: /home/kb535/anaconda3/envs/salience_detr/bin/python

But when I use the single card training command: CUDA_VISIBLEDEVICES=0 accelerate launch mainpy No errors occurred, normal training

Additional

No response

ssli23 commented 3 months ago

when resume_from_checkpoint=None, I run CUDA_VISIBLEDEVICES=0,1 accelerate launch main. py, it‘s okay, the training is normal.

xiuqhou commented 3 months ago

Hi, @ssli23 can you provide more error information output by scripts?

This may happen when you try to use multiple GPUs and some tensors are distributed on the cpu. But we don't know which operation caused the error. You can trace back to where the last error occurred and print the tensor to see where it is.

ssli23 commented 3 months ago

Hi, @ssli23 can you provide more error information output by scripts?

This may happen when you try to use multiple GPUs and some tensors are distributed on the cpu. But we don't know which operation caused the error. You can trace back to where the last error occurred and print the tensor to see where it is.

The following values were not passed to accelerate launch and had defaults used instead: --num_processes was set to a value of 2 More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in --num_processes=1. --num_machines was set to a value of 1 --mixed_precision was set to a value of 'no' --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. loading annotations into memory... loading annotations into memory... Done (t=0.00s) creating index... Done (t=0.00s) creating index... index created! index created! loading annotations into memory... loading annotations into memory... Done (t=0.00s) creating index... index created! Done (t=0.00s) creating index... index created! INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. [2024-07-22 15:56:03 det.util.misc]: Rank of current process: 0, World size: 2 Using [0, 0.5, 0.6299605249474366, 0.7937005259840997, 1.0, 1.2599210498948732, 1.5874010519681994, 2.0, inf] as bins for aspect ratio quantization Count of instances per bin: [866] Using /home/kb535/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... [2024-07-22 15:56:04 det.util.misc]: Environment info:

sys.platform linux Python 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0] numpy 1.24.3 PyTorch 1.11.0 @/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch PyTorch debug build False torch._C._GLIBCXX_USE_CXX11_ABI False GPU available Yes GPU 0,1 NVIDIA GeForce RTX 3090 (arch=8.6) Driver version 535.183.01 CUDA_HOME /usr/local/cuda Pillow 10.4.0 torchvision 0.12.0 @/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6 fvcore 0.1.5.post20220414 iopath 0.1.9 cv2 4.10.0

PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.2
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

[2024-07-22 15:56:04 det.util.misc]: Command line arguments: Namespace(accumulate_steps=1, config_file='configs/train_config.py', dynamo_backend='no', mixed_precision=None, seed=None, use_deterministic_algorithms=False) [2024-07-22 15:56:04 det.util.misc]: Contents of args.config_file=configs/train_config.py: from torch import optim

from datasets.coco import CocoDetection from transforms import presets from optimizer import param_dict

Commonly changed training configurations

num_epochs = 24 # train epochs batch_size = 4 # total_batch_size = #GPU x batch_size num_workers = 4 # workers for pytorch DataLoader pin_memory = True # whether pin_memory for pytorch DataLoader print_freq = 50 # frequency to print logs starting_epoch = 0 max_norm = 0.1 # clip gradient norm

output_dir = None # path to save checkpoints, default for None: checkpoints/{model_name} find_unused_parameters = False # useful for debugging distributed training

define dataset for train

coco_path = "/home/kb535/lss/data/200png_T2/purui_coco" # /PATH/TO/YOUR/COCODIR train_transform = presets.detr # see transforms/presets to choose a transform train_dataset = CocoDetection( img_folder=f"{coco_path}/train2014", ann_file=f"{coco_path}/annotations/instances_train2014.json", transforms=train_transform, train=True, ) test_dataset = CocoDetection( img_folder=f"{coco_path}/val2014", ann_file=f"{coco_path}/annotations/instances_val2014.json", transforms=None, # the eval_transform is integrated in the model )

model config to train

model_path = "configs/salience_detr/salience_detr_resnet50_800_1333.py"

specify a checkpoint folder to resume, or a pretrained ".pth" to finetune, for example:

checkpoints/salience_detr_resnet50_800_1333/train/2024-03-22-09_38_50

checkpoints/salience_detr_resnet50_800_1333/train/2024-03-22-09_38_50/best_ap.pth

resume_from_checkpoint = None

resume_from_checkpoint = '/home/kb535/lss/codes/objection/Salience-DETR/checkpoints/salience_detr_resnet50_800_1333_coco_2x.pth'

learning_rate = 1e-4 # initial learning rate optimizer = optim.AdamW(lr=learning_rate, weight_decay=1e-4, betas=(0.9, 0.999)) lr_scheduler = optim.lr_scheduler.MultiStepLR(milestones=[10], gamma=0.1)

This define parameter groups with different learning rate

param_dicts = param_dict.finetune_backbone_and_linear_projection(lr=learning_rate)

[2024-07-22 15:56:04 det.util.misc]: Using the random seed: 4521984 Using [0, 0.5, 0.6299605249474366, 0.7937005259840997, 1.0, 1.2599210498948732, 1.5874010519681994, 2.0, inf] as bins for aspect ratio quantization Count of instances per bin: [866] Using /home/kb535/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Loading extension module MultiScaleDeformableAttention... WARNING [2024-07-22 15:56:04 py.warnings]: /home/kb535/lss/codes/objection/Salience-DETR/models/bricks/ms_deform_attn.py:24: UserWarning: Failed to load MultiScaleDeformableAttention C++ extension: /home/kb535/.cache/torch_extensions/py38_cu113/MultiScaleDeformableAttention/MultiScaleDeformableAttention.so: cannot open shared object file: No such file or directory warnings.warn(f"Failed to load MultiScaleDeformableAttention C++ extension: {e}")

[2024-07-22 15:56:04 det.models.backbones.base_backbone]: Backbone architecture: resnet50 [2024-07-22 15:56:04 det.util.utils]: WARNING [2024-07-22 15:56:05 det.util.utils]: The model and loaded state dict do not match exactly WARNING [2024-07-22 15:56:05 det.util.utils]: Size mismatch keys: transformer.encoder.enhance_mcsp.weight, transformer.encoder.enhance_mcsp.bias, transformer.decoder.class_head.0.weight, transformer.decoder.class_head.0.bias, transformer.decoder.class_head.1.weight, transformer.decoder.class_head.1.bias, transformer.decoder.class_head.2.weight, transformer.decoder.class_head.2.bias, transformer.decoder.class_head.3.weight, transformer.decoder.class_head.3.bias, transformer.decoder.class_head.4.weight, transformer.decoder.class_head.4.bias, transformer.decoder.class_head.5.weight, transformer.decoder.class_head.5.bias, transformer.encoder_class_head.weight, transformer.encoder_class_head.bias, denoising_generator.label_encoder.weight +------------------------------------------+----------------+---------------------+ | key name | shape in model | shape in state dict | +------------------------------------------+----------------+---------------------+ | transformer.encoder.enhance_mcsp.weight | (2, 256) | (91, 256) | | transformer.encoder.enhance_mcsp.bias | (2,) | (91,) | | transformer.decoder.class_head.0.weight | (2, 256) | (91, 256) | | transformer.decoder.class_head.0.bias | (2,) | (91,) | | transformer.decoder.class_head.1.weight | (2, 256) | (91, 256) | | transformer.decoder.class_head.1.bias | (2,) | (91,) | | transformer.decoder.class_head.2.weight | (2, 256) | (91, 256) | | transformer.decoder.class_head.2.bias | (2,) | (91,) | | transformer.decoder.class_head.3.weight | (2, 256) | (91, 256) | | transformer.decoder.class_head.3.bias | (2,) | (91,) | | transformer.decoder.class_head.4.weight | (2, 256) | (91, 256) | | transformer.decoder.class_head.4.bias | (2,) | (91,) | | transformer.decoder.class_head.5.weight | (2, 256) | (91, 256) | | transformer.decoder.class_head.5.bias | (2,) | (91,) | | transformer.encoder_class_head.weight | (2, 256) | (91, 256) | | transformer.encoder_class_head.bias | (2,) | (91,) | +------------------------------------------+----------------+---------------------+ | denoising_generator.label_encoder.weight | (2, 256) | (91, 256) | +------------------------------------------+----------------+---------------------+

[2024-07-22 15:56:05 det.main]: load pretrained from /home/kb535/lss/codes/objection/Salience-DETR/checkpoints/salience_detr_resnet50_800_1333_coco_2x.pth, output_dir is checkpoints/salience_detr_resnet50_800_1333/train/2024-07-22-15_56_03 [2024-07-22 15:56:05 det.main]: Label names is saved to checkpoints/salience_detr_resnet50_800_1333/train/2024-07-22-15_56_03/label_names.txt [2024-07-22 15:56:05 det.main]: Start training Traceback (most recent call last): File "main.py", line 205, in train() File "main.py", line 178, in train train_one_epoch_acc( File "/home/kb535/lss/codes/objection/Salience-DETR/util/engine.py", line 46, in train_one_epoch_acc loss_dict = model(images, targets) File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 955, in forward self._sync_buffers() File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1602, in _sync_buffers self._sync_module_buffers(authoritative_rank) File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1606, in _sync_module_buffers self._default_broadcast_coalesced(authoritative_rank=authoritative_rank) File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1627, in _default_broadcast_coalesced self._distributed_broadcast_coalesced( File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1543, in _distributed_broadcast_coalesced dist._broadcast_coalesced( RuntimeError: Tensors must be CUDA and dense Traceback (most recent call last): File "main.py", line 205, in train() File "main.py", line 178, in train train_one_epoch_acc( File "/home/kb535/lss/codes/objection/Salience-DETR/util/engine.py", line 46, in train_one_epoch_acc loss_dict = model(images, targets) File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 955, in forward self._sync_buffers() File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1602, in _sync_buffers self._sync_module_buffers(authoritative_rank) File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1606, in _sync_module_buffers self._default_broadcast_coalesced(authoritative_rank=authoritative_rank) File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1627, in _default_broadcast_coalesced self._distributed_broadcast_coalesced( File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1543, in _distributed_broadcast_coalesced dist._broadcast_coalesced( RuntimeError: Tensors must be CUDA and dense ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 117231) of binary: /home/kb535/anaconda3/envs/salience_detr/bin/python Traceback (most recent call last): File "/home/kb535/anaconda3/envs/salience_detr/bin/accelerate", line 8, in sys.exit(main()) File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1088, in launch_command multi_gpu_launcher(args) File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher distrib_run.run(args) File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures: [1]: time : 2024-07-22_15:56:26 host : KB535 rank : 1 (local_rank: 1) exitcode : 1 (pid: 117232) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-07-22_15:56:26 host : KB535 rank : 0 (local_rank: 0) exitcode : 1 (pid: 117231) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

xiuqhou commented 3 months ago

when resume_from_checkpoint=None, I run CUDA_VISIBLEDEVICES=0,1 accelerate launch main. py, it‘s okay, the training is normal.

Thank you. It seems that the resume_from_checkpoint from the weight caused this error. I will test the code and find out why this happens.

ssli23 commented 3 months ago

when resume_from_checkpoint=None, I run CUDA_VISIBLEDEVICES=0,1 accelerate launch main. py, it‘s okay, the training is normal.

Thank you. It seems that the resume_from_checkpoint from the weight caused this error. I will test the code and find out why this happens.

hope to your reply

xiuqhou commented 3 months ago

I found the error is caused by the operation of loading the model checkpoint after accelerate.prepare. I have fixed the error and test it correctly. You can try the updated code!

xiuqhou / Salience-DETR

dist._broadcast_coalesced( #26

Question

Additional

Commonly changed training configurations

define dataset for train

model config to train

specify a checkpoint folder to resume, or a pretrained ".pth" to finetune, for example:

checkpoints/salience_detr_resnet50_800_1333/train/2024-03-22-09_38_50

checkpoints/salience_detr_resnet50_800_1333/train/2024-03-22-09_38_50/best_ap.pth

resume_from_checkpoint = None

This define parameter groups with different learning rate

main.py FAILED

Failures: [1]: time : 2024-07-22_15:56:26 host : KB535 rank : 1 (local_rank: 1) exitcode : 1 (pid: 117232) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html