[Usage] shape error raised by apply_rary_pos_emb

Describe the issue

Issue:

Hello, big thx for your great works. I was trying to run on 910B npus and has followed the installation, but I encountered the below erros. Please let me know if you need more information.

Command:

bash scripts/v1_5/finetune_npu.sh

Log:

q.shape: torch.Size([16, 32, 1417, 128])q.shape: torch.Size([16, 32, 1125, 128])
cos.shape: torch.Size([1, 1, 1125, 1125, 1, 128])
q.shape: torch.Size([1, 1, 1125, 1125, 1, 128])

cos.shape: torch.Size([1, 1, 1417, 1417, 1, 128])
q.shape: torch.Size([1, 1, 1417, 1417, 1, 128])
q.shape: torch.Size([16, 32, 1110, 128])
cos.shape: torch.Size([1, 1, 1110, 1110, 1, 128])
q.shape: torch.Size([1, 1, 1110, 1110, 1, 128])
q.shape: torch.Size([16, 32, 1284, 128])
cos.shape: torch.Size([1, 1, 1284, 1284, 1, 128])
q.shape: torch.Size([1, 1, 1284, 1284, 1, 128])
Traceback (most recent call last):
  File "/home/ma-user/work/LLaVA-NPU/llava/train/train_npu.py", line 14, in <module>
    train()
  File "/home/ma-user/work/LLaVA-NPU/llava/train/train.py", line 976, in train
    trainer.train()
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/transformers/trainer.py", line 2772, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/transformers/trainer.py", line 2795, in compute_loss
    outputs = model(**inputs)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1833, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/ma-user/work/LLaVA-NPU/llava/model/language_model/llava_llama.py", line 91, in forward
    return super().forward(
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1186, in forward
    outputs = self.model(
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1063, in forward
    layer_outputs = self._gradient_checkpointing_func(
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_compile.py", line 24, in inner
    return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 451, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 230, in forward
    outputs = run_function(*args)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 801, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 709, in forward
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
  File "/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 237, in apply_rotary_pos_emb
    q_embed = (q * cos) + (rotate_half(q) * sin)
RuntimeError: The size of tensor a (32) must match the size of tensor b (1417) at non-singleton dimension 3

RuntimeError: The size of tensor a (32) must match the size of tensor b (1284) at non-singleton dimension 3

RuntimeError: The size of tensor a (32) must match the size of tensor b (1110) at non-singleton dimension 3

@zkyntu Hi. Still the same shape mismatch problem during pretraining. I run the pretraining scipt by:

source /home/ma-user/Ascend/ascend-toolkit/set_env.sh

deepspeed llava/train/train_mem_npu.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path lmsys/vicuna-7b-v1.5 \
    --version plain \
    --data_path ./playground/data/LLaVA-Pretrain/blip_laion_cc_sbu_558k.json \
    --image_folder ./playground/data/LLaVA-Pretrain/images \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --output_dir ./checkpoints/llava-v1.5-7b-pretrain \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 24000 \
    --save_total_limit 1 \
    --learning_rate 1e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

and the train_mem_npu.py follow your train_npu.py:

from llava.train.llama_npu_monkey_patch import (
    replace_with_torch_npu_flash_attention,
    replace_with_torch_npu_rmsnorm
)

replace_with_torch_npu_flash_attention()
replace_with_torch_npu_rmsnorm()

from llava.train.train import train
import torch_npu
from torch_npu.contrib import transfer_to_npu

if __name__ == "__main__":
    train()

Also, I did not download openai/clip-vit-large-patch14-336 and lmsys/vicuna-7b-v1.5, but I notice that the code is loading some checkpoint shards, which makes me very confused

Loading checkpoint shards:   0%|                                                                                                         | 0/2 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                                         | 0/2 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                                         | 0/2 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                                         | 0/2 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                                         | 0/2 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                                         | 0/2 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                                         | 0/2 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|                                                                                                         | 0/2 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:13<00:00,  6.93s/it]
/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/transformers/generation/configuration_utils.py:392: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/transformers/generation/configuration_utils.py:397: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
model_args.model_name_or_path: lmsys/vicuna-7b-v1.5
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:16<00:00,  8.43s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:16<00:00,  8.37s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.63s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.57s/it]
model_args.vision_tower: openai/clip-vit-large-patch14-336
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:16<00:00,  8.33s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:16<00:00,  8.02s/it]

zkyredstart / LLaVA-NPU

[Usage] shape error raised by apply_rary_pos_emb #1

Describe the issue