运行flash_attn带来的错误

lonngxiang commented 2 months ago

out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd( RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [298,0,0], thread: [64,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed. ../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [298,0,0], thread: [65,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.

lonngxiang commented 2 months ago

flash-attn 2.6.3

lonngxiang commented 2 months ago

zhangfaen commented 2 months ago

I tried flash-attention 2 to train, got similary error. so i didn't mention this repo support flash-attention2. If you find how to support it, PR is welcome!

zhangfaen commented 2 months ago

I debugged what is wrong when enable flash_attention_2 in finetune.py.

Conclusion: fixed, see my latest commit https://github.com/zhangfaen/finetune-Qwen2-VL/commit/ff383f7b03b3beb5e8c5ca83c1ddc004ca7a2eec

How:

the root cause was from a bug of src/transformers/models/qwen2_vl/modeling_qwen2_vl.py
qwen team fixed that bug, see https://github.com/huggingface/transformers/commit/21fac7abba2a37fae86106f87fcf9974fd1e3830

Solution:

get latest of this repo by "git pull https://github.com/zhangfaen/finetune-Qwen2-VL/"
pip uninstall transformers
pip install -r requirements.txt

zhangfaen commented 2 months ago

I closed this issue. if you still have problem, feel free to re-open it.

lonngxiang commented 2 months ago

收到，可以训练了，就是显存占用还是不低；好项目，用pytorch原生训练

Guangming92 commented 2 months ago

@lonngxiang 麻烦问下，用的什么显卡？我用的4090，24G显存，迭代一轮就报显存不足~你数据量多大？可否交流一下？

lonngxiang commented 2 months ago

重新安装新transformers后，可以训练，也是4090，就是加载很慢

Guangming92 commented 2 months ago

重新安装新transformers后，可以训练，也是4090，就是加载很慢我按照上面指导，重新按照requirements.txt安装的库，之前确实也报跟你之前一样的问题，重新安装了，不报找不到索引的问题，但是提示显存不足，程序退出，训练数据都是按照demo来的，并且在加载模型的时候添加参数torch_dtype=torch.bfloat16, attn_implementation='flash_attention_2' 请教一下，你执行的时候调整了哪些内容？基础模型是Qwen2-VL-2B-Instruct吗？

lonngxiang commented 2 months ago

是，就把batch调成1了， 2eda960972ff493d692abec4d50faa06_c883dc392317403c8e35d5502fd1f0b3

476258834lzx commented 2 months ago

你好，请问你是怎么解决的

476258834lzx commented 2 months ago

重新安装新transformers后，可以训练，也是4090，就是加载很慢我按照上面指导，重新按照requirements.txt安装的库，之前确实也报跟你之前一样的问题，重新安装了，不报找不到索引的问题，但是提示显存不足，程序退出，训练数据都是按照demo来的，并且在加载模型的时候添加参数torch_dtype=torch.bfloat16, attn_implementation='flash_attention_2' 请教一下，你执行的时候调整了哪些内容？基础模型是Qwen2-VL-2B-Instruct吗？

你好，请问，你跑起来了吗

zhangfaen / finetune-Qwen2-VL

运行flash_attn带来的错误 #2