Hi, I have a error when i was pretraining the model. The error is "RuntimeError: CUDA out of memory."
I have tried to set the batchsize:2( 2*1), but it was occurred the above error.
I only modify the pretrain dataset to my cunstom dataset to pretrain.Other configurations are the same with your code.
Hi, I have a error when i was pretraining the model. The error is "RuntimeError: CUDA out of memory." I have tried to set the batchsize:2( 2*1), but it was occurred the above error. I only modify the pretrain dataset to my cunstom dataset to pretrain.Other configurations are the same with your code.
Here is the config:
Data
train_file: [ "./RSTPReid/pretrain/train_image.json", ]
train_dataset_size: 18505 # for IterableDataset images: {image_key: "file_path", image_root: "./RSTPReid/imgs/", is_image_rpath: True, # read path or base64 encoding caption_key: "captions", tokenized: False, # whether texts have been tokenized batch_size: 2, # 128 x 8 = 1024 num_workers: 1, # better -> the total number of training files % (world_size * num_workers) == 0 }
train_file_regions: [ './RSTPReid/pretrain/train_bb.json', ] regions: {image_key: "image", image_root: "/mnt/csip-101/reid-datasets/RSTPReid/imgs/", is_image_rpath: True, caption_key: "caption", tokenized: False, iter_perc: 1, batch_size: 128, max_images: 50, max_regions: 5, min_perc_in_image: 0.5, num_workers: 1}
Vision Encoder
use_beit_v2: True vision_config: 'configs/config_beit2_base.json' image_res: 224 patch_size: 16 local_attn_depth: -1
Text Encoder (& Cross Encoder)
text_encoder: './weight/bert-base-uncased' text_num_hidden_layers: 18 # include cross text_fusion_start_at: 12
Training
mixed_in_batch: True calc_image_bbox_loss: False embed_dim: 256 temp: 0.07
max_words: 40 max_tokens: 40 mask_prob: 0.5 max_masks: 12 mask_whole_word: True skipgram_prb: 0.2 skipgram_size: 3
Other Settings
ckpt_frequent_step: 50000 ckpt_frequent: 1000000000 # epoch optimizer: {opt: adamW, lr: 1e-4, weight_decay: 0.01, lr_mult: 2} schedular: {sched: linear, lr: 1e-4, epochs: 101, num_warmup_steps: 2500} # 之前是跑 200k steps, 现在感觉要跑 500k steps accelerator: {SYNCBN: false, FP16_OPT_LEVEL: O1, FP16_LOSS_SCALE: dynamic, RNG_SEED: 42, GRAD_ACCUMULATE_STEPS: 1, CLIP_GRAD_NORM: 1.0}