zengyan-97 / X2-VLM

All-In-One VLM: Image + Video + Transfer to Other Languages / Domains (TPAMI 2023)
BSD 3-Clause "New" or "Revised" License
123 stars 10 forks source link

Pretrain issue #17

Closed abueidvchow closed 3 months ago

abueidvchow commented 3 months ago

Hi, I have a error when i was pretraining the model. The error is "RuntimeError: CUDA out of memory." I have tried to set the batchsize:2( 2*1), but it was occurred the above error. I only modify the pretrain dataset to my cunstom dataset to pretrain.Other configurations are the same with your code.

Here is the config:

Data

train_file: [ "./RSTPReid/pretrain/train_image.json", ]

train_dataset_size: 18505 # for IterableDataset images: {image_key: "file_path", image_root: "./RSTPReid/imgs/", is_image_rpath: True, # read path or base64 encoding caption_key: "captions", tokenized: False, # whether texts have been tokenized batch_size: 2, # 128 x 8 = 1024 num_workers: 1, # better -> the total number of training files % (world_size * num_workers) == 0 }

train_file_regions: [ './RSTPReid/pretrain/train_bb.json', ] regions: {image_key: "image", image_root: "/mnt/csip-101/reid-datasets/RSTPReid/imgs/", is_image_rpath: True, caption_key: "caption", tokenized: False, iter_perc: 1, batch_size: 128, max_images: 50, max_regions: 5, min_perc_in_image: 0.5, num_workers: 1}

Vision Encoder

use_beit_v2: True vision_config: 'configs/config_beit2_base.json' image_res: 224 patch_size: 16 local_attn_depth: -1

Text Encoder (& Cross Encoder)

text_encoder: './weight/bert-base-uncased' text_num_hidden_layers: 18 # include cross text_fusion_start_at: 12

Training

mixed_in_batch: True calc_image_bbox_loss: False embed_dim: 256 temp: 0.07

max_words: 40 max_tokens: 40 mask_prob: 0.5 max_masks: 12 mask_whole_word: True skipgram_prb: 0.2 skipgram_size: 3

Other Settings

ckpt_frequent_step: 50000 ckpt_frequent: 1000000000 # epoch optimizer: {opt: adamW, lr: 1e-4, weight_decay: 0.01, lr_mult: 2} schedular: {sched: linear, lr: 1e-4, epochs: 101, num_warmup_steps: 2500} # 之前是跑 200k steps, 现在感觉要跑 500k steps accelerator: {SYNCBN: false, FP16_OPT_LEVEL: O1, FP16_LOSS_SCALE: dynamic, RNG_SEED: 42, GRAD_ACCUMULATE_STEPS: 1, CLIP_GRAD_NORM: 1.0}

abueidvchow commented 3 months ago

figure out, region batch size did not set.