【Llava-NeXT-interleave】CUDA error: device-side assert triggered

PiggyJerry commented 5 days ago

Traceback (most recent call last): File "/home/jiayi/lmms-finetune-main/train.py", line 248, in train() File "/home/jiayi/lmms-finetune-main/train.py", line 240, in train trainer.train() File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/trainer.py", line 2052, in train return inner_training_loop( File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/trainer.py", line 2388, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/trainer.py", line 3485, in training_step loss = self.compute_loss(model, inputs) File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/trainer.py", line 3532, in compute_loss outputs = model(inputs) File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(args, kwargs) File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/accelerate/utils/operations.py", line 823, in forward return model_forward(*args, kwargs) File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/accelerate/utils/operations.py", line 811, in call return convert_to_fp32(self.model_forward(*args, *kwargs)) File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast return func(args, kwargs) File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/peft/peft_model.py", line 1644, in forward return self.base_model( File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, *kwargs) File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 197, in forward return self.model.forward(args, kwargs) File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 522, in forward outputs = self.language_model( File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, *kwargs) File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1167, in forward outputs = self.model( File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 945, in forward causal_mask = self._update_causal_mask( File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1036, in _update_causal_mask if AttentionMaskConverter._ignore_causal_mask_sdpa( File "/home/jiayi/.conda/envs/lmms-finetune/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 284, in _ignore_causal_mask_sdpa elif (is_training or not is_tracing) and torch.all(attention_mask == 1): RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Why is it? My dataset is like, i have multi-images case: 5261731409660_ pic

zjysteven commented 4 days ago

Could you run the below example and see if any error pops up?

NUM_GPUS=1
DISTRIBUTED_ARGS="
    --nnodes=1 \
    --nproc_per_node ${NUM_GPUS} \
    --rdzv_backend c10d \
    --rdzv_endpoint localhost:0
"

# arguments that are very likely to be changed
# according to your own case
MODEL_ID=llava-interleave-qwen-7b                                   # model id; pick on by running `python supported_models.py`
TRAIN_DATA_PATH=./example_data/multi_images.json  # path to the training data json file
EVAL_DATA_PATH=./example_data/multi_images.json    # path to the evaluation data json file (optional)
IMAGE_FOLDER=./example_data/images                      # path to the image root folder; if provided, the image paths in the json should be relative
VIDEO_FOLDER=./example_data/videos                      # path to the video root folder; if provided, the video paths in the json should be relative
NUM_FRAMES=8                                            # how many frames are sampled from each video

TRAIN_VISION_ENCODER=False                              # whether train the vision encoder
USE_VISION_LORA=False                                   # whether use lora for vision encoder (only effective when `TRAIN_VISION_ENCODER` is True)
TRAIN_VISION_PROJECTOR=False                            # whether train the vision projector (only full finetuning is supported)

USE_LORA=True                                           # whether use lora for llm
Q_LORA=False                                            # whether use q-lora for llm; only effective when `USE_LORA` is True
LORA_R=4                                                # the lora rank (both llm and vision encoder)
LORA_ALPHA=8                                            # the lora alpha (both llm and vision encoder)

RUN_ID=${MODEL_ID}_lora-${USE_LORA}_qlora-${Q_LORA}     # a custom run id that determines the checkpoint folder and wandb run name

DS_STAGE=zero3                                          # deepspeed stage; < zero2 | zero3 >
PER_DEVICE_BATCH_SIZE=1                                 # batch size per GPU
GRAD_ACCUM=1                                            # gradient accumulation steps
NUM_EPOCHS=5                                            # number of training epochs

LR=2e-5                                                 # learning rate
MODEL_MAX_LEN=512                                       # maximum input length of the model

torchrun $DISTRIBUTED_ARGS train.py \
    --model_id $MODEL_ID \
    --data_path $TRAIN_DATA_PATH \
    --eval_data_path $EVAL_DATA_PATH \
    --image_folder $IMAGE_FOLDER \
    --video_folder $VIDEO_FOLDER \
    --num_frames $NUM_FRAMES \
    --output_dir ./checkpoints/$RUN_ID \
    --report_to wandb \
    --run_name $RUN_ID \
    --deepspeed ./ds_configs/${DS_STAGE}.json \
    --bf16 True \
    --num_train_epochs $NUM_EPOCHS \
    --per_device_train_batch_size $PER_DEVICE_BATCH_SIZE \
    --per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE \
    --gradient_accumulation_steps $GRAD_ACCUM \
    --eval_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 1 \
    --learning_rate ${LR} \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length $MODEL_MAX_LEN \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --train_vision_encoder $TRAIN_VISION_ENCODER \
    --use_vision_lora $USE_VISION_LORA \
    --train_vision_projector $TRAIN_VISION_PROJECTOR \
    --use_lora $USE_LORA \
    --q_lora $Q_LORA \
    --lora_r $LORA_R \
    --lora_alpha $LORA_ALPHA

PiggyJerry commented 4 days ago

Could you run the below example and see if any error pops up?

NUM_GPUS=1
DISTRIBUTED_ARGS="
    --nnodes=1 \
    --nproc_per_node ${NUM_GPUS} \
    --rdzv_backend c10d \
    --rdzv_endpoint localhost:0
"

# arguments that are very likely to be changed
# according to your own case
MODEL_ID=llava-interleave-qwen-7b                                   # model id; pick on by running `python supported_models.py`
TRAIN_DATA_PATH=./example_data/multi_images.json  # path to the training data json file
EVAL_DATA_PATH=./example_data/multi_images.json    # path to the evaluation data json file (optional)
IMAGE_FOLDER=./example_data/images                      # path to the image root folder; if provided, the image paths in the json should be relative
VIDEO_FOLDER=./example_data/videos                      # path to the video root folder; if provided, the video paths in the json should be relative
NUM_FRAMES=8                                            # how many frames are sampled from each video

TRAIN_VISION_ENCODER=False                              # whether train the vision encoder
USE_VISION_LORA=False                                   # whether use lora for vision encoder (only effective when `TRAIN_VISION_ENCODER` is True)
TRAIN_VISION_PROJECTOR=False                            # whether train the vision projector (only full finetuning is supported)

USE_LORA=True                                           # whether use lora for llm
Q_LORA=False                                            # whether use q-lora for llm; only effective when `USE_LORA` is True
LORA_R=4                                                # the lora rank (both llm and vision encoder)
LORA_ALPHA=8                                            # the lora alpha (both llm and vision encoder)

RUN_ID=${MODEL_ID}_lora-${USE_LORA}_qlora-${Q_LORA}     # a custom run id that determines the checkpoint folder and wandb run name

DS_STAGE=zero3                                          # deepspeed stage; < zero2 | zero3 >
PER_DEVICE_BATCH_SIZE=1                                 # batch size per GPU
GRAD_ACCUM=1                                            # gradient accumulation steps
NUM_EPOCHS=5                                            # number of training epochs

LR=2e-5                                                 # learning rate
MODEL_MAX_LEN=512                                       # maximum input length of the model

torchrun $DISTRIBUTED_ARGS train.py \
    --model_id $MODEL_ID \
    --data_path $TRAIN_DATA_PATH \
    --eval_data_path $EVAL_DATA_PATH \
    --image_folder $IMAGE_FOLDER \
    --video_folder $VIDEO_FOLDER \
    --num_frames $NUM_FRAMES \
    --output_dir ./checkpoints/$RUN_ID \
    --report_to wandb \
    --run_name $RUN_ID \
    --deepspeed ./ds_configs/${DS_STAGE}.json \
    --bf16 True \
    --num_train_epochs $NUM_EPOCHS \
    --per_device_train_batch_size $PER_DEVICE_BATCH_SIZE \
    --per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE \
    --gradient_accumulation_steps $GRAD_ACCUM \
    --eval_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 1 \
    --learning_rate ${LR} \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length $MODEL_MAX_LEN \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --train_vision_encoder $TRAIN_VISION_ENCODER \
    --use_vision_lora $USE_VISION_LORA \
    --train_vision_projector $TRAIN_VISION_PROJECTOR \
    --use_lora $USE_LORA \
    --q_lora $Q_LORA \
    --lora_r $LORA_R \
    --lora_alpha $LORA_ALPHA

Yes, the same error.

zjysteven commented 4 days ago

I cannot reproduce the error on my end. I can successfully run the above script.

PiggyJerry commented 4 hours ago

I cannot reproduce the error on my end. I can successfully run the above script.

Can you tell me your cuda version and torch version? Mine is cuda 12.2, torch 2.5.1

zjysteven / lmms-finetune

【Llava-NeXT-interleave】CUDA error: device-side assert triggered #52