ztxz16 / fastllm

纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行
Apache License 2.0
3.28k stars 332 forks source link

Request to Support FlashAttention and Multi-Query Attention Mechanisms #233

Open junior-zsy opened 1 year ago

junior-zsy commented 1 year ago

I hope this message finds you well. First off, thank you for providing such an incredible project on large model inference. I've been utilizing it extensively and it's been instrumental for many of my tasks.

However, I have recently been working with two attention mechanisms, namely, FlashAttention and Multi-Query Attention. These mechanisms have shown to be highly efficient and effective in various tasks, enhancing the capability of transformer models even further.

eigen2017 commented 1 year ago

i have same wishes, page attention, fused multihead attention, flash attention...

i discussed this with the author of flm, he said page attention will be added in the future

eigen2017 commented 1 year ago

we can also trace this post : https://github.com/ztxz16/fastllm/issues/150

or we can develop it ourselves