zhangfaen / finetune-Qwen2-VL

MIT License
208 stars 20 forks source link

运行flash_attn带来的错误 #2

Closed lonngxiang closed 2 months ago

lonngxiang commented 2 months ago

out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd( RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [298,0,0], thread: [64,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [298,0,0], thread: [65,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.

lonngxiang commented 2 months ago

flash-attn 2.6.3

lonngxiang commented 2 months ago

image

zhangfaen commented 2 months ago

I tried flash-attention 2 to train, got similary error. so i didn't mention this repo support flash-attention2. If you find how to support it, PR is welcome!

zhangfaen commented 2 months ago

I debugged what is wrong when enable flash_attention_2 in finetune.py.

Conclusion: fixed, see my latest commit https://github.com/zhangfaen/finetune-Qwen2-VL/commit/ff383f7b03b3beb5e8c5ca83c1ddc004ca7a2eec

How:

  1. the root cause was from a bug of src/transformers/models/qwen2_vl/modeling_qwen2_vl.py
  2. qwen team fixed that bug, see https://github.com/huggingface/transformers/commit/21fac7abba2a37fae86106f87fcf9974fd1e3830

Solution:

  1. get latest of this repo by "git pull https://github.com/zhangfaen/finetune-Qwen2-VL/"
  2. pip uninstall transformers
  3. pip install -r requirements.txt
zhangfaen commented 2 months ago

I closed this issue. if you still have problem, feel free to re-open it.

lonngxiang commented 2 months ago

收到,可以训练了,就是显存占用还是不低;好项目,用pytorch原生训练

Guangming92 commented 2 months ago

@lonngxiang 麻烦问下,用的什么显卡?我用的4090,24G显存,迭代一轮就报显存不足~你数据量多大?可否交流一下?

lonngxiang commented 2 months ago

重新安装新transformers后,可以训练,也是4090,就是加载很慢

Guangming92 commented 2 months ago

重新安装新transformers后,可以训练,也是4090,就是加载很慢 我按照上面指导,重新按照requirements.txt安装的库,之前确实也报跟你之前一样的问题,重新安装了,不报找不到索引的问题,但是提示显存不足,程序退出,训练数据都是按照demo来的,并且在加载模型的时候添加参数torch_dtype=torch.bfloat16, attn_implementation='flash_attention_2' image 请教一下,你执行的时候调整了哪些内容?基础模型是Qwen2-VL-2B-Instruct吗?

lonngxiang commented 2 months ago

是,就把batch调成1了, 2eda960972ff493d692abec4d50faa06_c883dc392317403c8e35d5502fd1f0b3

476258834lzx commented 2 months ago

image

你好 ,请问 你是怎么解决的

476258834lzx commented 2 months ago

重新安装新transformers后,可以训练,也是4090,就是加载很慢 我按照上面指导,重新按照requirements.txt安装的库,之前确实也报跟你之前一样的问题,重新安装了,不报找不到索引的问题,但是提示显存不足,程序退出,训练数据都是按照demo来的,并且在加载模型的时候添加参数torch_dtype=torch.bfloat16, attn_implementation='flash_attention_2' image 请教一下,你执行的时候调整了哪些内容?基础模型是Qwen2-VL-2B-Instruct吗?

你好,请问,你跑起来了吗