zhangfaen / finetune-Qwen2-VL

MIT License
218 stars 21 forks source link

这份代码微调不算lora微调?具体算是微调的哪层呢 #1

Closed lonngxiang closed 2 months ago

zhangfaen commented 2 months ago

这份代码 微调所有的代码。 如果只想微调一部分代码, 可以 简单写几行代码, 把一些参数 固定住,不更新就行了。 例如 这个文件中的几行代码 https://github.com/zhangfaen/finetune-InternVL2/blob/main/train.py

model.vision_model.requires_grad_(False) # 冻结这module中的参数, 不更新... 
model.language_model.requires_grad_(False)  # 冻结这module中的参数, 不更新... 

logger.info(f"total params for Lora training: {sum(p.numel() for p in model.parameters())}")
logger.info(f"total trainable params for Lora training: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")

optimizer = AdamW(model.parameters(), lr=lr)
lonngxiang commented 2 months ago

好的谢谢,这份代码4090卡微调报错: 4090都不能微调2b模型? python finetune.py image

zhangfaen commented 2 months ago

显存太少了.... 我记得 我用bf16, batch size 1,显存要用到40GB左右。

zhangfaen commented 2 months ago

Adamw 可以换成 SGD 试试, 应该也会少用很多显存。 另外,还可以 把这行代码中的 processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=2562828, max_pixels=5122828, padding_side="right")

256和512改成更小的数,也就是图片会被压缩的更狠一点

zhangfaen commented 2 months ago

latest of this repo could finetune with flash-attention-2, which could save memory. try it. I close this issue then.

lonngxiang commented 2 months ago

model

Qwen2-VL 的接口好像没看到vision_model、language_model

image

zhangfaen commented 2 months ago

(Pdb++) model

# Qwen2VLForConditionalGeneration(
#   (visual): Qwen2VisionTransformerPretrainedModel(
#     (patch_embed): PatchEmbed(
#       (proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
#     )
#     (rotary_pos_emb): VisionRotaryEmbedding()
#     (blocks): ModuleList(
#       (0-31): 32 x Qwen2VLVisionBlock(
#         (norm1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
#         (norm2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
#         (attn): VisionSdpaAttention(
#           (qkv): Linear(in_features=1280, out_features=3840, bias=True)
#           (proj): Linear(in_features=1280, out_features=1280, bias=True)
#         )
#         (mlp): VisionMlp(
#           (fc1): Linear(in_features=1280, out_features=5120, bias=True)
#           (act): QuickGELUActivation()
#           (fc2): Linear(in_features=5120, out_features=1280, bias=True)
#         )
#       )
#     )
#     (merger): PatchMerger(
#       (ln_q): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
#       (mlp): Sequential(
#         (0): Linear(in_features=5120, out_features=5120, bias=True)
#         (1): GELU(approximate='none')
#         (2): Linear(in_features=5120, out_features=1536, bias=True)
#       )
#     )
#   )
#   (model): Qwen2VLModel(
#     (embed_tokens): Embedding(151936, 1536)
#     (layers): ModuleList(
#       (0-27): 28 x Qwen2VLDecoderLayer(
#         (self_attn): Qwen2VLSdpaAttention(
#           (q_proj): Linear(in_features=1536, out_features=1536, bias=True)
#           (k_proj): Linear(in_features=1536, out_features=256, bias=True)
#           (v_proj): Linear(in_features=1536, out_features=256, bias=True)
#           (o_proj): Linear(in_features=1536, out_features=1536, bias=False)
#           (rotary_emb): Qwen2RotaryEmbedding()
#         )
#         (mlp): Qwen2MLP(
#           (gate_proj): Linear(in_features=1536, out_features=8960, bias=False)
#           (up_proj): Linear(in_features=1536, out_features=8960, bias=False)
#           (down_proj): Linear(in_features=8960, out_features=1536, bias=False)
#           (act_fn): SiLU()
#         )
#         (input_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
#         (post_attention_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
#       )
#     )
#     (norm): Qwen2RMSNorm((1536,), eps=1e-06)
#   )
#   (lm_head): Linear(in_features=1536, out_features=151936, bias=False)
# )

above is print of qwen2-vl model.

you can use model.visual or model.visual.patch_embed to access sub-modules of the model and set requiresgrad

lonngxiang commented 2 months ago

(Pdb++) model

# Qwen2VLForConditionalGeneration(
#   (visual): Qwen2VisionTransformerPretrainedModel(
#     (patch_embed): PatchEmbed(
#       (proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
#     )
#     (rotary_pos_emb): VisionRotaryEmbedding()
#     (blocks): ModuleList(
#       (0-31): 32 x Qwen2VLVisionBlock(
#         (norm1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
#         (norm2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
#         (attn): VisionSdpaAttention(
#           (qkv): Linear(in_features=1280, out_features=3840, bias=True)
#           (proj): Linear(in_features=1280, out_features=1280, bias=True)
#         )
#         (mlp): VisionMlp(
#           (fc1): Linear(in_features=1280, out_features=5120, bias=True)
#           (act): QuickGELUActivation()
#           (fc2): Linear(in_features=5120, out_features=1280, bias=True)
#         )
#       )
#     )
#     (merger): PatchMerger(
#       (ln_q): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
#       (mlp): Sequential(
#         (0): Linear(in_features=5120, out_features=5120, bias=True)
#         (1): GELU(approximate='none')
#         (2): Linear(in_features=5120, out_features=1536, bias=True)
#       )
#     )
#   )
#   (model): Qwen2VLModel(
#     (embed_tokens): Embedding(151936, 1536)
#     (layers): ModuleList(
#       (0-27): 28 x Qwen2VLDecoderLayer(
#         (self_attn): Qwen2VLSdpaAttention(
#           (q_proj): Linear(in_features=1536, out_features=1536, bias=True)
#           (k_proj): Linear(in_features=1536, out_features=256, bias=True)
#           (v_proj): Linear(in_features=1536, out_features=256, bias=True)
#           (o_proj): Linear(in_features=1536, out_features=1536, bias=False)
#           (rotary_emb): Qwen2RotaryEmbedding()
#         )
#         (mlp): Qwen2MLP(
#           (gate_proj): Linear(in_features=1536, out_features=8960, bias=False)
#           (up_proj): Linear(in_features=1536, out_features=8960, bias=False)
#           (down_proj): Linear(in_features=8960, out_features=1536, bias=False)
#           (act_fn): SiLU()
#         )
#         (input_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
#         (post_attention_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
#       )
#     )
#     (norm): Qwen2RMSNorm((1536,), eps=1e-06)
#   )
#   (lm_head): Linear(in_features=1536, out_features=151936, bias=False)
# )

above is print of qwen2-vl model.

you can use model.visual or model.visual.patch_embed to access sub-modules of the model and set requiresgrad

那对于语音模型是哪个模块呢