zd11024 / NaviLLM

[CVPR 2024] The code for paper 'Towards Learning a Generalist Model for Embodied Navigation'
MIT License
101 stars 7 forks source link

8*A100 out of memory #13

Closed Jzian closed 1 month ago

Jzian commented 1 month ago

Hi, thanks for your open source! I have a problem about the training setting. I first trained with 8*A100 and set 8 gradient_accumulation_step as the same as yours, but it is out of memory immediately. Then I tried to set 6 gradient_accumulation_step like 48 batch size, it is out of memory in the 11 epoch. Do you have any ideas about this issue? C90FD394-E673-449f-B42E-3AE94CDCA671

My log : 2024-07-11 11:10:59,424 INFO **Start logging** 2024-07-11 11:10:59,425 INFO CUDA_VISIBLE_DEVICES=ALL 2024-07-11 11:10:59,425 INFO data_dir /data/ws/NaviLLM/data 2024-07-11 11:10:59,425 INFO cfg_file configs/multi.yaml 2024-07-11 11:10:59,425 INFO pretrained_model_name_or_path /data/ws/NaviLLM/data/models/Vicuna-7B 2024-07-11 11:10:59,425 INFO off_batch_task False 2024-07-11 11:10:59,425 INFO debug False 2024-07-11 11:10:59,425 INFO seed 0 2024-07-11 11:10:59,425 INFO num_epochs 30 2024-07-11 11:10:59,425 INFO resume_from_checkpoint None 2024-07-11 11:10:59,426 INFO from_scratch False 2024-07-11 11:10:59,426 INFO batch_size 1 2024-07-11 11:10:59,426 INFO val_batch_size 2 2024-07-11 11:10:59,426 INFO lr 3e-05 2024-07-11 11:10:59,426 INFO feat_dropout 0.4 2024-07-11 11:10:59,426 INFO num_warmup_steps 0 2024-07-11 11:10:59,426 INFO num_steps_per_epoch 2000 2024-07-11 11:10:59,426 INFO gradient_accumulation_step 6 2024-07-11 11:10:59,427 INFO precision amp_bf16 2024-07-11 11:10:59,427 INFO workers 0 2024-07-11 11:10:59,427 INFO world_size 8 2024-07-11 11:10:59,427 INFO local_rank 0 2024-07-11 11:10:59,427 INFO dist_url env:// 2024-07-11 11:10:59,427 INFO dist_backend nccl 2024-07-11 11:10:59,427 INFO horovod False 2024-07-11 11:10:59,427 INFO no_set_device_rank False 2024-07-11 11:10:59,428 INFO output_dir output/multi_wo_pretrain 2024-07-11 11:10:59,428 INFO max_saved_checkpoints 1 2024-07-11 11:10:59,428 INFO save_ckpt_per_epochs 10 2024-07-11 11:10:59,428 INFO save_latest_states False 2024-07-11 11:10:59,428 INFO save_pred_results False 2024-07-11 11:10:59,428 INFO save_detail_results False 2024-07-11 11:10:59,428 INFO mode train 2024-07-11 11:10:59,428 INFO stage multi 2024-07-11 11:10:59,429 INFO ignoreid -100 2024-07-11 11:10:59,429 INFO enable_og True 2024-07-11 11:10:59,429 INFO enable_summarize True 2024-07-11 11:10:59,429 INFO enable_fgr2r True 2024-07-11 11:10:59,429 INFO gen_loss_coef 1.0 2024-07-11 11:10:59,429 INFO obj_loss_coef 1.0 2024-07-11 11:10:59,429 INFO teacher_forcing_coef 1.0 2024-07-11 11:10:59,429 INFO fuse_obj False 2024-07-11 11:10:59,430 INFO multi_endpoints 1 2024-07-11 11:10:59,430 INFO path_type trusted_path 2024-07-11 11:10:59,430 INFO test_datasets ['CVDN', 'R2R'] 2024-07-11 11:10:59,430 INFO validation_split val_unseen 2024-07-11 11:10:59,430 INFO do_sample False 2024-07-11 11:10:59,430 INFO temperature 1.0 2024-07-11 11:10:59,430 INFO max_datapoints None 2024-07-11 11:10:59,430 INFO rank 0 2024-07-11 11:10:59,430 INFO distributed True 2024-07-11 11:10:59,430 INFO device cuda:0 2024-07-11 11:10:59,431 INFO image_feat_size 1024 2024-07-11 11:10:59,431 INFO obj_feat_size 768 2024-07-11 11:10:59,431 INFO angle_feat_size 4 2024-07-11 11:10:59,431 INFO enc_full_graph True 2024-07-11 11:10:59,431 INFO expert_policy spl 2024-07-11 11:10:59,431 INFO num_pano_layers 2 2024-07-11 11:10:59,431 INFO ----------- Feature ----------- 2024-07-11 11:10:59,431 INFO cfg.Feature.object_feature_type: 2024-07-11 11:10:59,432 INFO cfg.Feature.angle_feat_size: 4 2024-07-11 11:10:59,432 INFO cfg.Feature.max_objects: 70 2024-07-11 11:10:59,432 INFO cfg.Feature.image_feat_size: 1024 2024-07-11 11:10:59,432 INFO ----------- feature_database ----------- 2024-07-11 11:10:59,432 INFO cfg.Feature.feature_database.mp3d: eva_features/mp3d_EVA02-CLIP-L-14-336.hdf5 2024-07-11 11:10:59,432 INFO cfg.Feature.feature_database.scan_qa: eva_features/scanqa_EVA02-CLIP-L-14-336.hdf5 2024-07-11 11:10:59,432 INFO cfg.Feature.feature_database.coco: eva_features/coco_EVA02-CLIP-L-14-336.hdf5 2024-07-11 11:10:59,432 INFO cfg.Feature.obj_feat_size: 768 2024-07-11 11:10:59,433 INFO ----------- object_database ----------- 2024-07-11 11:10:59,433 INFO cfg.Feature.object_database.reverie: obj_features/reverie_obj_feat 2024-07-11 11:10:59,433 INFO cfg.Feature.object_database.soon: obj_features/soon_obj_feat 2024-07-11 11:10:59,433 INFO ----------- Dataset ----------- 2024-07-11 11:10:59,433 INFO ----------- R2R ----------- 2024-07-11 11:10:59,433 INFO cfg.Dataset.R2R.DIR: R2R 2024-07-11 11:10:59,433 INFO ----------- SPLIT ----------- 2024-07-11 11:10:59,434 INFO cfg.Dataset.R2R.SPLIT.train: FGR2R_train.json 2024-07-11 11:10:59,434 INFO cfg.Dataset.R2R.SPLIT.val_seen: R2R_val_seen_enc.json 2024-07-11 11:10:59,434 INFO cfg.Dataset.R2R.SPLIT.val_unseen: R2R_val_unseen_enc.json 2024-07-11 11:10:59,434 INFO cfg.Dataset.R2R.SPLIT.test: R2R_test_enc.json 2024-07-11 11:10:59,434 INFO ----------- REVERIE ----------- 2024-07-11 11:10:59,434 INFO cfg.Dataset.REVERIE.DIR: REVERIE 2024-07-11 11:10:59,434 INFO cfg.Dataset.REVERIE.bbox_file: BBoxes.json 2024-07-11 11:10:59,434 INFO ----------- SPLIT ----------- 2024-07-11 11:10:59,435 INFO cfg.Dataset.REVERIE.SPLIT.train: REVERIE_train_enc.json 2024-07-11 11:10:59,435 INFO cfg.Dataset.REVERIE.SPLIT.val_seen: REVERIE_val_seen_enc.json 2024-07-11 11:10:59,435 INFO cfg.Dataset.REVERIE.SPLIT.val_unseen: REVERIE_val_unseen_enc.json 2024-07-11 11:10:59,435 INFO cfg.Dataset.REVERIE.SPLIT.test: REVERIE_test_enc.json 2024-07-11 11:10:59,435 INFO ----------- CVDN ----------- 2024-07-11 11:10:59,435 INFO cfg.Dataset.CVDN.DIR: CVDN 2024-07-11 11:10:59,435 INFO ----------- SPLIT ----------- 2024-07-11 11:10:59,435 INFO cfg.Dataset.CVDN.SPLIT.train: train.json 2024-07-11 11:10:59,436 INFO cfg.Dataset.CVDN.SPLIT.val_seen: val_seen.json 2024-07-11 11:10:59,436 INFO cfg.Dataset.CVDN.SPLIT.val_unseen: val_unseen.json 2024-07-11 11:10:59,436 INFO cfg.Dataset.CVDN.SPLIT.test: test_cleaned.json 2024-07-11 11:10:59,436 INFO ----------- SOON ----------- 2024-07-11 11:10:59,436 INFO cfg.Dataset.SOON.DIR: SOON 2024-07-11 11:10:59,436 INFO ----------- SPLIT ----------- 2024-07-11 11:10:59,436 INFO cfg.Dataset.SOON.SPLIT.train: train_enc_pseudo_obj_ade30k_label.jsonl 2024-07-11 11:10:59,436 INFO cfg.Dataset.SOON.SPLIT.val_seen: val_unseen_instrs_enc_pseudo_obj_ade30k_label.jsonl 2024-07-11 11:10:59,436 INFO cfg.Dataset.SOON.SPLIT.val_unseen: val_unseen_house_enc_pseudo_obj_ade30k_label.jsonl 2024-07-11 11:10:59,436 INFO cfg.Dataset.SOON.SPLIT.test: test_v2_enc.jsonl 2024-07-11 11:10:59,437 INFO ----------- ScanQA ----------- 2024-07-11 11:10:59,437 INFO cfg.Dataset.ScanQA.DIR: ScanQA 2024-07-11 11:10:59,437 INFO ----------- SPLIT ----------- 2024-07-11 11:10:59,437 INFO cfg.Dataset.ScanQA.SPLIT.train: ScanQA_v1.0_train_reformat.json 2024-07-11 11:10:59,437 INFO cfg.Dataset.ScanQA.SPLIT.val_unseen: ScanQA_v1.0_val_reformat.json 2024-07-11 11:10:59,437 INFO cfg.Dataset.ScanQA.SPLIT.test_wo_obj: ScanQA_v1.0_test_wo_obj_reformat.json 2024-07-11 11:10:59,437 INFO cfg.Dataset.ScanQA.SPLIT.test_w_obj: ScanQA_v1.0_test_w_obj_reformat.json 2024-07-11 11:10:59,437 INFO ----------- EQA ----------- 2024-07-11 11:10:59,438 INFO cfg.Dataset.EQA.DIR: EQA_MP3D 2024-07-11 11:10:59,438 INFO ----------- SPLIT ----------- 2024-07-11 11:10:59,438 INFO cfg.Dataset.EQA.SPLIT.val_unseen: eqa_val_enc.json 2024-07-11 11:10:59,438 INFO cfg.Dataset.EQA.ANSWER_VOCAB: eqa_answer_vocab.json 2024-07-11 11:10:59,438 INFO ----------- R2R_AUG ----------- 2024-07-11 11:10:59,438 INFO cfg.Dataset.R2R_AUG.DIR: R2R 2024-07-11 11:10:59,438 INFO ----------- SPLIT ----------- 2024-07-11 11:10:59,438 INFO cfg.Dataset.R2R_AUG.SPLIT.train: R2R_prevalent_aug_train_enc.jsonl 2024-07-11 11:10:59,438 INFO ----------- REVERIE_AUG ----------- 2024-07-11 11:10:59,439 INFO cfg.Dataset.REVERIE_AUG.DIR: REVERIE 2024-07-11 11:10:59,439 INFO cfg.Dataset.REVERIE_AUG.bbox_file: BBoxes.json 2024-07-11 11:10:59,439 INFO ----------- SPLIT ----------- 2024-07-11 11:10:59,439 INFO cfg.Dataset.REVERIE_AUG.SPLIT.train: REVERIE_speaker_aug_enc.jsonl 2024-07-11 11:10:59,439 INFO ----------- LLaVA ----------- 2024-07-11 11:10:59,439 INFO cfg.Dataset.LLaVA.DIR: LLaVA 2024-07-11 11:10:59,439 INFO ----------- SPLIT ----------- 2024-07-11 11:10:59,439 INFO cfg.Dataset.LLaVA.SPLIT.train: detail_23k.json 2024-07-11 11:10:59,440 INFO ----------- Pretrain ----------- 2024-07-11 11:10:59,440 INFO cfg.Pretrain.SOURCE: ['R2R_AUG', 'REVERIE_AUG', 'R2R', 'REVERIE', 'SOON', 'CVDN', 'ScanQA'] 2024-07-11 11:10:59,440 INFO cfg.Pretrain.Ratio: [20, 2, 1, 1, 1, 1, 1] 2024-07-11 11:10:59,440 INFO ----------- LOSS_COEF ----------- 2024-07-11 11:10:59,440 INFO cfg.Pretrain.LOSS_COEF.R2R_AUG: 1 2024-07-11 11:10:59,440 INFO cfg.Pretrain.LOSS_COEF.REVERIE_AUG: 1 2024-07-11 11:10:59,440 INFO ----------- Multi ----------- 2024-07-11 11:10:59,440 INFO cfg.Multi.SOURCE: ['R2R', 'REVERIE', 'CVDN', 'SOON', 'ScanQA', 'LLaVA'] 2024-07-11 11:10:59,440 INFO cfg.Multi.Ratio: [20, 5, 1, 5, 5, 5] 2024-07-11 11:10:59,441 INFO ----------- LOSS_COEF ----------- 2024-07-11 11:10:59,441 INFO ----------- Model ----------- 2024-07-11 11:10:59,441 INFO cfg.Model.num_l_layers: 9 2024-07-11 11:10:59,441 INFO cfg.Model.num_pano_layers: 2 2024-07-11 11:10:59,441 INFO cfg.Model.num_x_layers: 4 2024-07-11 11:10:59,441 INFO cfg.Model.graph_sprels: True 2024-07-11 11:10:59,441 INFO cfg.Model.fusion: dynamic 2024-07-11 11:10:59,441 INFO cfg.Model.enc_full_graph: True 2024-07-11 11:10:59,441 INFO cfg.Model.expert_policy: spl 2024-07-11 11:10:59,442 INFO ----------- Optim ----------- 2024-07-11 11:10:59,442 INFO ----------- val_max_action_len ----------- 2024-07-11 11:10:59,442 INFO cfg.Optim.val_max_action_len.R2R: 15 2024-07-11 11:10:59,442 INFO cfg.Optim.val_max_action_len.REVERIE: 15 2024-07-11 11:10:59,442 INFO cfg.Optim.val_max_action_len.CVDN: 30 2024-07-11 11:10:59,442 INFO cfg.Optim.val_max_action_len.SOON: 20 2024-07-11 11:10:59,442 INFO cfg.Optim.val_max_action_len.EQA: 15 2024-07-11 11:10:59,442 INFO ----------- train_max_action_len ----------- 2024-07-11 11:10:59,443 INFO cfg.Optim.train_max_action_len.R2R: 15 2024-07-11 11:10:59,443 INFO cfg.Optim.train_max_action_len.REVERIE: 15 2024-07-11 11:10:59,443 INFO cfg.Optim.train_max_action_len.CVDN: 15 2024-07-11 11:10:59,443 INFO cfg.Optim.train_max_action_len.SOON: 15 2024-07-11 11:10:59,443 INFO cfg.Optim.train_max_action_len.EQA: 15 2024-07-11 11:10:59,443 INFO cfg.Optim.train_max_action_len.R2R_AUG: 15 2024-07-11 11:10:59,443 INFO cfg.Optim.train_max_action_len.REVERIE_AUG: 15 2024-07-11 11:11:13,099 INFO [INFO] R2RDataset loaded with 14039 instructions, using splits: train 2024-07-11 11:11:13,100 INFO

2024-07-11 12:28:51,933 INFO validate val_unseen split on CVDN task 2024-07-11 12:31:26,332 INFO eval 912 predictions 2024-07-11 12:31:26,398 INFO validate val_unseen split on R2R task 2024-07-11 12:33:15,297 INFO eval 2352 predictions 2024-07-11 12:33:15,355 INFO
[Eval] val_unseen epoch 0

[Eval] dataset=[CVDN] , lengths: 65.57, nav_error: 17.43, oracle_sr: 40.57 [Eval] ||| sr: 7.02, spl: 3.56, oracle path_success_rate: 72.92, dist_to_end_reduction: 2.08 [Eval] dataset=[R2R] , action_steps: 7.12, steps: 9.46, lengths: 18.99, nav_error: 9.59, oracle_error: 4.74 [Eval] ||| sr: 19.09, oracle_sr: 39.16, spl: 14.97 2024-07-11 12:33:15,357 INFO Current Score: 0.24947403834097154 2024-07-11 12:33:15,357 INFO Best Score: 0.24947403834097154

...

2024-07-12 01:54:48,425 INFO train [11] epoch 2024-07-12 01:54:48,426 INFO Loss: 7.54 Instr_pred: 1.29 R2R: 9.14 REVERIE: 8.96 CVDN: 8.30 SOON: 12.73 ScanQA: 1.25 LLaVA: 1.36

2024-07-12 01:54:48,429 INFO validate val_unseen split on CVDN task 2024-07-12 01:56:46,665 INFO eval 912 predictions 2024-07-12 01:56:46,711 INFO validate val_unseen split on R2R task 2024-07-12 01:58:22,814 INFO eval 2352 predictions 2024-07-12 01:58:22,867 INFO
[Eval] val_unseen epoch 11

[Eval] dataset=[CVDN] , lengths: 39.21, nav_error: 15.38, oracle_sr: 51.21 [Eval] ||| sr: 11.62, spl: 7.76, oracle path_success_rate: 79.28, dist_to_end_reduction: 4.23 [Eval] dataset=[R2R] , action_steps: 6.48, steps: 7.32, lengths: 14.29, nav_error: 4.44, oracle_error: 2.07 [Eval] ||| sr: 60.97, oracle_sr: 75.17, spl: 52.68 2024-07-12 01:58:22,869 INFO Current Score: 0.8780779540421723 2024-07-12 01:58:22,869 INFO Best Score: 0.8797513951790916

zd11024 commented 1 month ago

It is quite wired. When I was conducting experiments, it could be able to execute successfully when it was run on 8 A100 exclusively. If you still cannot resolve this problem, maybe you could try to adjust max_action_len in configs.

Jzian commented 1 month ago

Thanks for your reply!