Open liuruijin17 opened 3 days ago
@liuruijin17 Hi, this is my second account. I suggest that you first run the pertaining and then run the fintune. I find directly using the pretraining weight of the official repo is wrong for Ascend devices.
@zkyntu Hi. Still the same shape mismatch problem during pretraining. I run the pretraining scipt by:
source /home/ma-user/Ascend/ascend-toolkit/set_env.sh
deepspeed llava/train/train_mem_npu.py \
--deepspeed ./scripts/zero2.json \
--model_name_or_path lmsys/vicuna-7b-v1.5 \
--version plain \
--data_path ./playground/data/LLaVA-Pretrain/blip_laion_cc_sbu_558k.json \
--image_folder ./playground/data/LLaVA-Pretrain/images \
--vision_tower openai/clip-vit-large-patch14-336 \
--mm_projector_type mlp2x_gelu \
--tune_mm_mlp_adapter True \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir ./checkpoints/llava-v1.5-7b-pretrain \
--num_train_epochs 1 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 24000 \
--save_total_limit 1 \
--learning_rate 1e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 False \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb
and the train_mem_npu.py follow your train_npu.py:
from llava.train.llama_npu_monkey_patch import (
replace_with_torch_npu_flash_attention,
replace_with_torch_npu_rmsnorm
)
replace_with_torch_npu_flash_attention()
replace_with_torch_npu_rmsnorm()
from llava.train.train import train
import torch_npu
from torch_npu.contrib import transfer_to_npu
if __name__ == "__main__":
train()
Also, I did not download openai/clip-vit-large-patch14-336 and lmsys/vicuna-7b-v1.5, but I notice that the code is loading some checkpoint shards, which makes me very confused
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:13<00:00, 6.93s/it]
/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/transformers/generation/configuration_utils.py:392: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
warnings.warn(
/home/ma-user/anaconda3/envs/llava/lib/python3.9/site-packages/transformers/generation/configuration_utils.py:397: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
warnings.warn(
model_args.model_name_or_path: lmsys/vicuna-7b-v1.5
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:16<00:00, 8.43s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:16<00:00, 8.37s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:15<00:00, 7.63s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:15<00:00, 7.57s/it]
model_args.vision_tower: openai/clip-vit-large-patch14-336
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:16<00:00, 8.33s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:16<00:00, 8.02s/it]
@liuruijin17 Hi, the code will download the weight of CLIP and Vicuna, but I suggest that you can download the weight manually.
I have a way to test the code: 1) download the official LLaVa1.5 weight and then run the evaluation or inference. I find this problem is the model architecture. If you have any questions, feel free to ask.
Describe the issue
Issue:
Hello, big thx for your great works. I was trying to run on 910B npus and has followed the installation, but I encountered the below erros. Please let me know if you need more information.
Command:
Log: