Closed prattcmp closed 5 months ago
Hi @prattcmp Thanks for filing the issue.
It's as modeling_qwen.py doesn't handle None attention_mask when using spda backend. https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct/blob/1efef6cb8e5b06824152b8fa2a42e762bd4a3571/modeling_qwen.py#L1038
The workaround would be add a parament here to specify flash-attn backend https://github.com/wejoncy/QLLM/blob/7ecb24b9c53b0ba7b46c140457170b44682e631a/qllm/modeling/base.py#L176
llm = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path, torch_dtype=torch.float16, trust_remote_code=trust_remote_code
,attn_implementation="flash_attention_2")
Or fix qwen function https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct/blob/1efef6cb8e5b06824152b8fa2a42e762bd4a3571/modeling_qwen.py#L1038
attention_mask = _prepare_4d_attention_mask_for_sdpa(
attention_mask, inputs_embeds.dtype
)
``` to support None mask
or
Fix here to pass attention_mask
try: # noqa:SIM105 model(batch[0].to(dev), batch[1].to(dev)) except ValueError: pass
https://github.com/wejoncy/QLLM/blob/7ecb24b9c53b0ba7b46c140457170b44682e631a/qllm/quantization/quant_frame_base.py#L90
I create a fix PR to support specify use_flash_attn by a environment variable USE_FLASH_ATTN=1
Wow, how did you untangle that so quickly?
Env variable is great. Would also be nice to have it as a CLI flag. Any regression risk if the flag is default true/enabled?
Wow, how did you untangle that so quickly?
Env variable is great. Would also be nice to have it as a CLI flag. Any regression risk if the flag is default true/enabled?
If use_flash_attention is set by default, it will require user to install flash-attn package. However, flash-attn requires sm>=8.0, which would be impossible for GPUs like V100 or below.
Besides, Transformers will select the appropriate backend for attention (eager/spda/flash-attn), the first two works great for all GPUs.
The library unfortunately isn't working with the
Alibaba-NLP/gte-Qwen2-7B-instruct
Transformers model.python -m qllm --model Alibaba-NLP/gte-Qwen2-7B-instruct --method gptq --save ./gte-Qwen2-7B-4bit --export_onnx ./gte-Qwen2-7B-4bit_onnx --allow_mix_bits --true-sequential