Open ssli23 opened 3 months ago
when resume_from_checkpoint=None, I run CUDA_VISIBLEDEVICES=0,1 accelerate launch main. py, it‘s okay, the training is normal.
Hi, @ssli23 can you provide more error information output by scripts?
This may happen when you try to use multiple GPUs and some tensors are distributed on the cpu. But we don't know which operation caused the error. You can trace back to where the last error occurred and print the tensor to see where it is.
Hi, @ssli23 can you provide more error information output by scripts?
This may happen when you try to use multiple GPUs and some tensors are distributed on the cpu. But we don't know which operation caused the error. You can trace back to where the last error occurred and print the tensor to see where it is.
The following values were not passed to accelerate launch
and had defaults used instead:
--num_processes
was set to a value of 2
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in --num_processes=1
.
--num_machines
was set to a value of 1
--mixed_precision
was set to a value of 'no'
--dynamo_backend
was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config
.
loading annotations into memory...
loading annotations into memory...
Done (t=0.00s)
creating index...
Done (t=0.00s)
creating index...
index created!
index created!
loading annotations into memory...
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
Done (t=0.00s)
creating index...
index created!
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[2024-07-22 15:56:03 det.util.misc]: Rank of current process: 0, World size: 2
Using [0, 0.5, 0.6299605249474366, 0.7937005259840997, 1.0, 1.2599210498948732, 1.5874010519681994, 2.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [866]
Using /home/kb535/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
[2024-07-22 15:56:04 det.util.misc]: Environment info:
sys.platform linux Python 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0] numpy 1.24.3 PyTorch 1.11.0 @/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch PyTorch debug build False torch._C._GLIBCXX_USE_CXX11_ABI False GPU available Yes GPU 0,1 NVIDIA GeForce RTX 3090 (arch=8.6) Driver version 535.183.01 CUDA_HOME /usr/local/cuda Pillow 10.4.0 torchvision 0.12.0 @/home/kb535/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6 fvcore 0.1.5.post20220414 iopath 0.1.9 cv2 4.10.0
PyTorch built with:
[2024-07-22 15:56:04 det.util.misc]: Command line arguments: Namespace(accumulate_steps=1, config_file='configs/train_config.py', dynamo_backend='no', mixed_precision=None, seed=None, use_deterministic_algorithms=False) [2024-07-22 15:56:04 det.util.misc]: Contents of args.config_file=configs/train_config.py: from torch import optim
from datasets.coco import CocoDetection from transforms import presets from optimizer import param_dict
num_epochs = 24 # train epochs batch_size = 4 # total_batch_size = #GPU x batch_size num_workers = 4 # workers for pytorch DataLoader pin_memory = True # whether pin_memory for pytorch DataLoader print_freq = 50 # frequency to print logs starting_epoch = 0 max_norm = 0.1 # clip gradient norm
output_dir = None # path to save checkpoints, default for None: checkpoints/{model_name} find_unused_parameters = False # useful for debugging distributed training
coco_path = "/home/kb535/lss/data/200png_T2/purui_coco" # /PATH/TO/YOUR/COCODIR train_transform = presets.detr # see transforms/presets to choose a transform train_dataset = CocoDetection( img_folder=f"{coco_path}/train2014", ann_file=f"{coco_path}/annotations/instances_train2014.json", transforms=train_transform, train=True, ) test_dataset = CocoDetection( img_folder=f"{coco_path}/val2014", ann_file=f"{coco_path}/annotations/instances_val2014.json", transforms=None, # the eval_transform is integrated in the model )
model_path = "configs/salience_detr/salience_detr_resnet50_800_1333.py"
resume_from_checkpoint = '/home/kb535/lss/codes/objection/Salience-DETR/checkpoints/salience_detr_resnet50_800_1333_coco_2x.pth'
learning_rate = 1e-4 # initial learning rate optimizer = optim.AdamW(lr=learning_rate, weight_decay=1e-4, betas=(0.9, 0.999)) lr_scheduler = optim.lr_scheduler.MultiStepLR(milestones=[10], gamma=0.1)
param_dicts = param_dict.finetune_backbone_and_linear_projection(lr=learning_rate)
[2024-07-22 15:56:04 det.util.misc]: Using the random seed: 4521984 Using [0, 0.5, 0.6299605249474366, 0.7937005259840997, 1.0, 1.2599210498948732, 1.5874010519681994, 2.0, inf] as bins for aspect ratio quantization Count of instances per bin: [866] Using /home/kb535/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Loading extension module MultiScaleDeformableAttention... WARNING [2024-07-22 15:56:04 py.warnings]: /home/kb535/lss/codes/objection/Salience-DETR/models/bricks/ms_deform_attn.py:24: UserWarning: Failed to load MultiScaleDeformableAttention C++ extension: /home/kb535/.cache/torch_extensions/py38_cu113/MultiScaleDeformableAttention/MultiScaleDeformableAttention.so: cannot open shared object file: No such file or directory warnings.warn(f"Failed to load MultiScaleDeformableAttention C++ extension: {e}")
[2024-07-22 15:56:04 det.models.backbones.base_backbone]: Backbone architecture: resnet50
[2024-07-22 15:56:04 det.util.utils]:
Root Cause (first observed failure): [0]: time : 2024-07-22_15:56:26 host : KB535 rank : 0 (local_rank: 0) exitcode : 1 (pid: 117231) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
when resume_from_checkpoint=None, I run CUDA_VISIBLEDEVICES=0,1 accelerate launch main. py, it‘s okay, the training is normal.
Thank you. It seems that the resume_from_checkpoint from the weight caused this error. I will test the code and find out why this happens.
when resume_from_checkpoint=None, I run CUDA_VISIBLEDEVICES=0,1 accelerate launch main. py, it‘s okay, the training is normal.
Thank you. It seems that the resume_from_checkpoint from the weight caused this error. I will test the code and find out why this happens.
hope to your reply
I found the error is caused by the operation of loading the model checkpoint after accelerate.prepare
. I have fixed the error and test it correctly. You can try the updated code!
Question
Modify resume_from_checkpoint to 'checkpoints/salines_detr_desnet50_800_1333_com_2x. pth'
When I use the dual card training command: CUDA_VISIBLEDEVICES=0,1 accelerate launch main. py RuntimeError: Tensors must be CUDA and dense WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 96188 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 96189) of binary: /home/kb535/anaconda3/envs/salience_detr/bin/python
But when I use the single card training command: CUDA_VISIBLEDEVICES=0 accelerate launch mainpy No errors occurred, normal training
Additional
No response