Open PiggyJerry opened 5 days ago
Could you run the below example and see if any error pops up?
NUM_GPUS=1
DISTRIBUTED_ARGS="
--nnodes=1 \
--nproc_per_node ${NUM_GPUS} \
--rdzv_backend c10d \
--rdzv_endpoint localhost:0
"
# arguments that are very likely to be changed
# according to your own case
MODEL_ID=llava-interleave-qwen-7b # model id; pick on by running `python supported_models.py`
TRAIN_DATA_PATH=./example_data/multi_images.json # path to the training data json file
EVAL_DATA_PATH=./example_data/multi_images.json # path to the evaluation data json file (optional)
IMAGE_FOLDER=./example_data/images # path to the image root folder; if provided, the image paths in the json should be relative
VIDEO_FOLDER=./example_data/videos # path to the video root folder; if provided, the video paths in the json should be relative
NUM_FRAMES=8 # how many frames are sampled from each video
TRAIN_VISION_ENCODER=False # whether train the vision encoder
USE_VISION_LORA=False # whether use lora for vision encoder (only effective when `TRAIN_VISION_ENCODER` is True)
TRAIN_VISION_PROJECTOR=False # whether train the vision projector (only full finetuning is supported)
USE_LORA=True # whether use lora for llm
Q_LORA=False # whether use q-lora for llm; only effective when `USE_LORA` is True
LORA_R=4 # the lora rank (both llm and vision encoder)
LORA_ALPHA=8 # the lora alpha (both llm and vision encoder)
RUN_ID=${MODEL_ID}_lora-${USE_LORA}_qlora-${Q_LORA} # a custom run id that determines the checkpoint folder and wandb run name
DS_STAGE=zero3 # deepspeed stage; < zero2 | zero3 >
PER_DEVICE_BATCH_SIZE=1 # batch size per GPU
GRAD_ACCUM=1 # gradient accumulation steps
NUM_EPOCHS=5 # number of training epochs
LR=2e-5 # learning rate
MODEL_MAX_LEN=512 # maximum input length of the model
torchrun $DISTRIBUTED_ARGS train.py \
--model_id $MODEL_ID \
--data_path $TRAIN_DATA_PATH \
--eval_data_path $EVAL_DATA_PATH \
--image_folder $IMAGE_FOLDER \
--video_folder $VIDEO_FOLDER \
--num_frames $NUM_FRAMES \
--output_dir ./checkpoints/$RUN_ID \
--report_to wandb \
--run_name $RUN_ID \
--deepspeed ./ds_configs/${DS_STAGE}.json \
--bf16 True \
--num_train_epochs $NUM_EPOCHS \
--per_device_train_batch_size $PER_DEVICE_BATCH_SIZE \
--per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE \
--gradient_accumulation_steps $GRAD_ACCUM \
--eval_strategy "epoch" \
--save_strategy "epoch" \
--save_total_limit 1 \
--learning_rate ${LR} \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length $MODEL_MAX_LEN \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--train_vision_encoder $TRAIN_VISION_ENCODER \
--use_vision_lora $USE_VISION_LORA \
--train_vision_projector $TRAIN_VISION_PROJECTOR \
--use_lora $USE_LORA \
--q_lora $Q_LORA \
--lora_r $LORA_R \
--lora_alpha $LORA_ALPHA
Could you run the below example and see if any error pops up?
NUM_GPUS=1 DISTRIBUTED_ARGS=" --nnodes=1 \ --nproc_per_node ${NUM_GPUS} \ --rdzv_backend c10d \ --rdzv_endpoint localhost:0 " # arguments that are very likely to be changed # according to your own case MODEL_ID=llava-interleave-qwen-7b # model id; pick on by running `python supported_models.py` TRAIN_DATA_PATH=./example_data/multi_images.json # path to the training data json file EVAL_DATA_PATH=./example_data/multi_images.json # path to the evaluation data json file (optional) IMAGE_FOLDER=./example_data/images # path to the image root folder; if provided, the image paths in the json should be relative VIDEO_FOLDER=./example_data/videos # path to the video root folder; if provided, the video paths in the json should be relative NUM_FRAMES=8 # how many frames are sampled from each video TRAIN_VISION_ENCODER=False # whether train the vision encoder USE_VISION_LORA=False # whether use lora for vision encoder (only effective when `TRAIN_VISION_ENCODER` is True) TRAIN_VISION_PROJECTOR=False # whether train the vision projector (only full finetuning is supported) USE_LORA=True # whether use lora for llm Q_LORA=False # whether use q-lora for llm; only effective when `USE_LORA` is True LORA_R=4 # the lora rank (both llm and vision encoder) LORA_ALPHA=8 # the lora alpha (both llm and vision encoder) RUN_ID=${MODEL_ID}_lora-${USE_LORA}_qlora-${Q_LORA} # a custom run id that determines the checkpoint folder and wandb run name DS_STAGE=zero3 # deepspeed stage; < zero2 | zero3 > PER_DEVICE_BATCH_SIZE=1 # batch size per GPU GRAD_ACCUM=1 # gradient accumulation steps NUM_EPOCHS=5 # number of training epochs LR=2e-5 # learning rate MODEL_MAX_LEN=512 # maximum input length of the model torchrun $DISTRIBUTED_ARGS train.py \ --model_id $MODEL_ID \ --data_path $TRAIN_DATA_PATH \ --eval_data_path $EVAL_DATA_PATH \ --image_folder $IMAGE_FOLDER \ --video_folder $VIDEO_FOLDER \ --num_frames $NUM_FRAMES \ --output_dir ./checkpoints/$RUN_ID \ --report_to wandb \ --run_name $RUN_ID \ --deepspeed ./ds_configs/${DS_STAGE}.json \ --bf16 True \ --num_train_epochs $NUM_EPOCHS \ --per_device_train_batch_size $PER_DEVICE_BATCH_SIZE \ --per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE \ --gradient_accumulation_steps $GRAD_ACCUM \ --eval_strategy "epoch" \ --save_strategy "epoch" \ --save_total_limit 1 \ --learning_rate ${LR} \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length $MODEL_MAX_LEN \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --train_vision_encoder $TRAIN_VISION_ENCODER \ --use_vision_lora $USE_VISION_LORA \ --train_vision_projector $TRAIN_VISION_PROJECTOR \ --use_lora $USE_LORA \ --q_lora $Q_LORA \ --lora_r $LORA_R \ --lora_alpha $LORA_ALPHA
Yes, the same error.
I cannot reproduce the error on my end. I can successfully run the above script.
I cannot reproduce the error on my end. I can successfully run the above script.
Can you tell me your cuda version and torch version? Mine is cuda 12.2, torch 2.5.1
Traceback (most recent call last): File "/home/jiayi/lmms-finetune-main/train.py", line 248, in
train()
File "/home/jiayi/lmms-finetune-main/train.py", line 240, in train
trainer.train()
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/trainer.py", line 2052, in train
return inner_training_loop(
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/trainer.py", line 2388, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/trainer.py", line 3485, in training_step
loss = self.compute_loss(model, inputs)
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/trainer.py", line 3532, in compute_loss
outputs = model(inputs)
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(args, kwargs)
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/accelerate/utils/operations.py", line 823, in forward
return model_forward(*args, kwargs)
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/accelerate/utils/operations.py", line 811, in call
return convert_to_fp32(self.model_forward(*args, *kwargs))
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
return func(args, kwargs)
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/peft/peft_model.py", line 1644, in forward
return self.base_model(
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, *kwargs)
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 197, in forward
return self.model.forward(args, kwargs)
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 522, in forward
outputs = self.language_model(
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, *kwargs)
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1167, in forward
outputs = self.model(
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 945, in forward
causal_mask = self._update_causal_mask(
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1036, in _update_causal_mask
if AttentionMaskConverter._ignore_causal_mask_sdpa(
File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 284, in _ignore_causal_mask_sdpa
elif (is_training or not is_tracing) and torch.all(attention_mask == 1):
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Why is it? My dataset is like, i have multi-images case: