ztxz16 / fastllm

纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行
Apache License 2.0
3.32k stars 340 forks source link

Possible improvement (Fastertransformer) #150

Open iamfaith opened 1 year ago

iamfaith commented 1 year ago

Thanks for your great project!

I was wondering if it would be possible to incorporate the concept of fastertransformer into this project. For instance, in the file src/fastertransformer/layers/attention_layers/GptContextAttentionLayer.cc, you could consider combining attention operations, such as the GptContextAttentionLayer, and wrapping each operation using cublas_wrapper_->SpGemm. This approach might help enhance the efficiency and speed of this project.

iamfaith commented 1 year ago

Op benchmark on my device. Maybe fusing some ops can improve the speed.....

AVX: ON AVX2: ON AARCH64: OFF Neon FP16: OFF Neon DOT: OFF ------after warmup

Optype [Embedding] use 0.004840 s Optype [RMSNorm] use 0.000063 s Optype [Linear] use 0.000768 s Optype [Split] use 0.000073 s Optype [Split] use 0.000039 s Optype [PermuteSelf] use 0.000077 s Optype [PermuteSelf] use 0.000059 s Optype [PermuteSelf] use 0.000072 s Optype [MatMulTransB] use 0.023129 s Optype [MatMul] use 0.000048 s Optype [PermuteSelf] use 0.000077 s Optype [Linear] use 0.000060 s Optype [Linear] use 0.000491 s Optype [Swiglu] use 0.000452 s Optype [Linear] use 0.000092 s Optype [Linear] use 0.000054 s Optype [PermuteSelf] use 0.016285 s Optype [PermuteSelf] use 0.000043 s Optype [PermuteSelf] use 0.000049 s Optype [MatMulTransB] use 0.000029 s Optype [PermuteSelf] use 0.000031 s Optype [Linear] use 0.000060 s Optype [Linear] use 0.000055 s Optype [Linear] use 0.000052 s Optype [Linear] use 0.000049 s Optype [PermuteSelf] use 0.019234 s Optype [PermuteSelf] use 0.000049 s Optype [PermuteSelf] use 0.000031 s Optype [MatMulTransB] use 0.000042 s Optype [PermuteSelf] use 0.000033 s Optype [Linear] use 0.000048 s Optype [Linear] use 0.000054 s Optype [Linear] use 0.000052 s Optype [Linear] use 0.000061 s Optype [PermuteSelf] use 0.019022 s Optype [PermuteSelf] use 0.000048 s Optype [PermuteSelf] use 0.000037 s Optype [MatMulTransB] use 0.000029 s Optype [PermuteSelf] use 0.000026 s Optype [Linear] use 0.000061 s Optype [Linear] use 0.000044 s Optype [Linear] use 0.000057 s Optype [Linear] use 0.000048 s Optype [PermuteSelf] use 0.017333 s Optype [PermuteSelf] use 0.000186 s Optype [CatDirect] use 0.000042 s Optype [CatDirect] use 0.000033 s Optype [PermuteSelf] use 0.000060 s Optype [MatMulTransB] use 0.000034 s Optype [PermuteSelf] use 0.000034 s Optype [Linear] use 0.000047 s Optype [Linear] use 0.000057 s Optype [Linear] use 0.000049 s Optype [AddTo] use 0.000022 s Optype [Linear] use 0.000056 s Optype [PermuteSelf] use 0.018827 s Optype [PermuteSelf] use 0.000040 s Optype [PermuteSelf] use 0.000038 s Optype [MatMulTransB] use 0.000054 s Optype [MatMul] use 0.000021 s Optype [PermuteSelf] use 0.000045 s Optype [Linear] use 0.000049 s Optype [Linear] use 0.000062 s Optype [Linear] use 0.000055 s Optype [Linear] use 0.000065 s Optype [PermuteSelf] use 0.017346 s Optype [PermuteSelf] use 0.000062 s Optype [CatDirect] use 0.000031 s Optype [PermuteSelf] use 0.000044 s Optype [MatMulTransB] use 0.000056 s Optype [PermuteSelf] use 0.000071 s Optype [Linear] use 0.000054 s Optype [Linear] use 0.000063 s Optype [Linear] use 0.000059 s Optype [Linear] use 0.000052 s Optype [PermuteSelf] use 0.019089 s Optype [PermuteSelf] use 0.000043 s Optype [CatDirect] use 0.000032 s Optype [PermuteSelf] use 0.000041 s Optype [MatMulTransB] use 0.000032 s Optype [PermuteSelf] use 0.000029 s Optype [Linear] use 0.000075 s Optype [Linear] use 0.000052 s Optype [Linear] use 0.000054 s Optype [Linear] use 0.000053 s Optype [PermuteSelf] use 0.019014 s Optype [PermuteSelf] use 0.000093 s Optype [CatDirect] use 0.000037 s Optype [PermuteSelf] use 0.000067 s Optype [MatMulTransB] use 0.000039 s Optype [SoftMax] use 0.000209 s Optype [MatMul] use 0.000040 s Optype [PermuteSelf] use 0.000071 s Optype [Linear] use 0.000075 s Optype [Mul] use 0.000053 s Optype [RMSNorm] use 0.000021 s Optype [Linear] use 0.000066 s Optype [Linear] use 0.000054 s Optype [Linear] use 0.000053 s Optype [Split] use 0.000063 s Optype [Split] use 0.000025 s Optype [NearlyRotatePosition2D] use 0.000022 s Optype [PermuteSelf] use 0.016833 s Optype [PermuteSelf] use 0.000132 s Optype [PermuteSelf] use 0.000040 s Optype [MatMulTransB] use 0.000041 s Optype [PermuteSelf] use 0.000040 s Optype [Linear] use 0.000163 s Optype [AddTo] use 0.000046 s Optype [Linear] use 0.000087 s Optype [Linear] use 0.000080 s Optype [RMSNorm] use 0.000026 s Optype [Linear] use 0.000094 s Optype [Split] use 0.000030 s Optype [Split] use 0.000026 s Optype [Split] use 0.000024 s Optype [PermuteSelf] use 0.018522 s Optype [PermuteSelf] use 0.000106 s Optype [PermuteSelf] use 0.000102 s Optype [MatMulTransB] use 0.000032 s Optype [MatMul] use 0.000035 s Optype [PermuteSelf] use 0.000038 s Optype [Linear] use 0.000077 s Optype [Linear] use 0.000089 s Optype [Linear] use 0.000082 s Optype [Linear] use 0.000083 s Optype [NearlyRotatePosition2D] use 0.000028 s Optype [PermuteSelf] use 0.018652 s Optype [PermuteSelf] use 0.000056 s Optype [PermuteSelf] use 0.000049 s Optype [MatMulTransB] use 0.000060 s Optype [MatMul] use 0.000037 s Optype [PermuteSelf] use 0.000059 s Optype [Linear] use 0.000087 s Optype [RMSNorm] use 0.000032 s Optype [Linear] use 0.000080 s Optype [Linear] use 0.000079 s Optype [Linear] use 0.000080 s Optype [Split] use 0.000036 s Optype [PermuteSelf] use 0.016986 s Optype [PermuteSelf] use 0.000040 s Optype [PermuteSelf] use 0.000033 s Optype [MatMulTransB] use 0.000026 s Optype [PermuteSelf] use 0.000029 s Optype [Linear] use 0.000054 s Optype [Linear] use 0.000053 s Optype [Linear] use 0.000056 s Optype [Linear] use 0.000066 s Optype [PermuteSelf] use 0.019204 s Optype [PermuteSelf] use 0.000095 s Optype [CatDirect] use 0.000038 s Optype [PermuteSelf] use 0.000075 s Optype [MatMulTransB] use 0.000056 s Optype [MatMul] use 0.000032 s Optype [PermuteSelf] use 0.000054 s Optype [Linear] use 0.000081 s Optype [Linear] use 0.000090 s Optype [Linear] use 0.000075 s Optype [AddTo] use 0.000030 s Optype [Linear] use 0.000079 s Optype [Split] use 0.000035 s Optype [PermuteSelf] use 0.017039 s Optype [PermuteSelf] use 0.000076 s Optype [PermuteSelf] use 0.000027 s Optype [MatMulTransB] use 0.000033 s Optype [MatMul] use 0.000032 s Optype [PermuteSelf] use 0.000058 s Optype [Linear] use 0.000080 s Optype [RMSNorm] use 0.000031 s Optype [Linear] use 0.000078 s Optype [Linear] use 0.000080 s Optype [AddTo] use 0.000029 s Optype [Linear] use 0.000080 s Optype [Split] use 0.000036 s Optype [NearlyRotatePosition2D] use 0.000031 s Optype [PermuteSelf] use 0.018969 s Optype [PermuteSelf] use 0.000064 s Optype [PermuteSelf] use 0.000035 s Optype [MatMulTransB] use 0.000035 s Optype [PermuteSelf] use 0.000026 s Optype [Linear] use 0.000051 s Optype [Linear] use 0.000059 s Optype [Linear] use 0.000061 s Optype [Linear] use 0.000063 s Optype [PermuteSelf] use 0.018808 s Optype [PermuteSelf] use 0.000057 s Optype [PermuteSelf] use 0.000040 s Optype [MatMulTransB] use 0.000034 s Optype [AttentionMask] use 0.000026 s Optype [PermuteSelf] use 0.000110 s Optype [Linear] use 0.000050 s Optype [Linear] use 0.000080 s Optype [Linear] use 0.000083 s Optype [Linear] use 0.000081 s Optype [Split] use 0.000039 s Optype [Split] use 0.000037 s Optype [PermuteSelf] use 0.017157 s Optype [PermuteSelf] use 0.000047 s Optype [PermuteSelf] use 0.000035 s Optype [MatMulTransB] use 0.000033 s Optype [PermuteSelf] use 0.000039 s Optype [Linear] use 0.000044 s Optype [Linear] use 0.000067 s Optype [Linear] use 0.000050 s Optype [Linear] use 0.000057 s Optype [PermuteSelf] use 0.019182 s Optype [PermuteSelf] use 0.000059 s Optype [PermuteSelf] use 0.000038 s Optype [MatMulTransB] use 0.000032 s Optype [PermuteSelf] use 0.000052 s Optype [Linear] use 0.000049 s Optype [Linear] use 0.000062 s Optype [Linear] use 0.000068 s Optype [Linear] use 0.000052 s Optype [PermuteSelf] use 0.017368 s Optype [PermuteSelf] use 0.000464 s Optype [PermuteSelf] use 0.000071 s Optype [MatMulTransB] use 0.000027 s Optype [PermuteSelf] use 0.000036 s Optype [Linear] use 0.000055 s Optype [Linear] use 0.000056 s Optype [Linear] use 0.000053 s Optype [Linear] use 0.000055 s Optype [PermuteSelf] use 0.018501 s Optype [PermuteSelf] use 0.000047 s Optype [PermuteSelf] use 0.000038 s Optype [MatMulTransB] use 0.000035 s Optype [PermuteSelf] use 0.000049 s Optype [Linear] use 0.000053 s Optype [Linear] use 0.000051 s Optype [Linear] use 0.000055 s Optype [Linear] use 0.000053 s Optype [PermuteSelf] use 0.019000 s Optype [PermuteSelf] use 0.000052 s Optype [CatDirect] use 0.000051 s Optype [PermuteSelf] use 0.000051 s Optype [MatMulTransB] use 0.000078 s Optype [PermuteSelf] use 0.000038 s Optype [Linear] use 0.000073 s Optype [Linear] use 0.000064 s Optype [Linear] use 0.000055 s Optype [Linear] use 0.000059 s Optype [PermuteSelf] use 0.017269 s Optype [PermuteSelf] use 0.000047 s Optype [PermuteSelf] use 0.000033 s Optype [MatMulTransB] use 0.000037 s Optype [PermuteSelf] use 0.000091 s Optype [Linear] use 0.000114 s Optype [Linear] use 0.000105 s Optype [Linear] use 0.000095 s Optype [Linear] use 0.000113 s Optype [PermuteSelf] use 0.018669 s Optype [PermuteSelf] use 0.000049 s Optype [PermuteSelf] use 0.000029 s Optype [MatMulTransB] use 0.000032 s Optype [PermuteSelf] use 0.000056 s Optype [Linear] use 0.000056 s Optype [Linear] use 0.000053 s Optype [Linear] use 0.000053 s Optype [Linear] use 0.000043 s Optype [PermuteSelf] use 0.019055 s Optype [PermuteSelf] use 0.000089 s Optype [PermuteSelf] use 0.000028 s Optype [MatMulTransB] use 0.000044 s Optype [PermuteSelf] use 0.000041 s Optype [Linear] use 0.000049 s Optype [Linear] use 0.000055 s Optype [Linear] use 0.000056 s Optype [Linear] use 0.000060 s Optype [PermuteSelf] use 0.017319 s Optype [PermuteSelf] use 0.000101 s Optype [PermuteSelf] use 0.000066 s Optype [MatMulTransB] use 0.000024 s Optype [PermuteSelf] use 0.000045 s Optype [Linear] use 0.000058 s Optype [Linear] use 0.000130 s Optype [Linear] use 0.000098 s Optype [Linear] use 0.000083 s Optype [Split] use 0.000036 s Optype [PermuteSelf] use 0.019455 s Optype [PermuteSelf] use 0.000092 s Optype [CatDirect] use 0.000036 s Optype [PermuteSelf] use 0.000093 s Optype [MatMulTransB] use 0.000057 s Optype [MatMul] use 0.000032 s Optype [PermuteSelf] use 0.000057 s Optype [Linear] use 0.000080 s Optype [RMSNorm] use 0.000032 s Optype [Linear] use 0.000080 s Optype [Linear] use 0.000090 s Optype [AddTo] use 0.000033 s Optype [Linear] use 0.000115 s Optype [Split] use 0.000027 s Optype [Split] use 0.000034 s Optype [PermuteSelf] use 0.016747 s Optype [PermuteSelf] use 0.000069 s Optype [CatDirect] use 0.000031 s Optype [PermuteSelf] use 0.000104 s Optype [MatMulTransB] use 0.000047 s Optype [SoftMax] use 0.000029 s Optype [PermuteSelf] use 0.000030 s Optype [Linear] use 0.000065 s Optype [Linear] use 0.000082 s Optype [Linear] use 0.000078 s Optype [Linear] use 0.000623 s Optype [Embedding] use 0.000433 s Optype [RMSNorm] use 0.000022 s Optype [MatMulTransB] use 0.000550 s Optype [RMSNorm] use 0.000023 s Optype [MatMulTransB] use 0.000040 s Optype [MatMul] use 0.000031 s Optype [CatDirect] use 0.000036 s Optype [MatMulTransB] use 0.000025 s Optype [SoftMax] use 0.000028 s Optype [MatMul] use 0.000041 s Optype [RMSNorm] use 0.000029 s Optype [Linear] use 0.000022 s Optype [AddTo] use 0.000029 s Optype [AddTo] use 0.000033 s Optype [Mul] use 0.000106 s Optype [Split] use 0.000026 s Optype [Split] use 0.000023 s Optype [CatDirect] use 0.000026 s Optype [RMSNorm] use 0.000021 s Optype [Embedding] use 0.000431 s Optype [RMSNorm] use 0.000025 s Optype [Linear] use 0.000043 s Optype [NearlyRotatePosition2D] use 0.000034 s Optype [PermuteSelf] use 0.000023 s Optype [MatMulTransB] use 0.000061 s Optype [Mul] use 0.000032 s Optype [MatMul] use 0.000041 s Optype [Split] use 0.000033 s Optype [MatMulTransB] use 0.000038 s Optype [RMSNorm] use 0.000034 s Optype [CatDirect] use 0.000032 s Optype [NearlyRotatePosition2D] use 0.000032 s Optype [Linear] use 0.000036 s Optype [AddTo] use 0.000063 s Optype [NearlyRotatePosition2D] use 0.000033 s Optype [MatMulTransB] use 0.000033 s Optype [CatDirect] use 0.000034 s Optype [MatMul] use 0.000064 s Optype [Linear] use 0.000034 s Optype [AttentionMask] use 0.000033 s Optype [NearlyRotatePosition2D] use 0.000033 s Optype [Mul] use 0.000061 s Optype [MatMulTransB] use 0.000037 s Optype [NearlyRotatePosition2D] use 0.000034 s Optype [AttentionMask] use 0.000031 s Optype [NearlyRotatePosition2D] use 0.000031 s Optype [Mul] use 0.000062 s Optype [AttentionMask] use 0.000033 s Optype [NearlyRotatePosition2D] use 0.000032 s Optype [Mul] use 0.000062 s Optype [RMSNorm] use 0.000032 s Optype [Embedding] use 0.000572 s Optype [RMSNorm] use 0.000024 s Optype [MatMulTransB] use 0.000043 s Optype [Split] use 0.000029 s Optype [NearlyRotatePosition2D] use 0.000031 s Optype [Linear] use 0.000022 s Optype [Mul] use 0.000027 s Optype [RMSNorm] use 0.000022 s Optype [Embedding] use 0.000440 s Optype [NearlyRotatePosition2D] use 0.000033 s Optype [CatDirect] use 0.000036 s Optype [MatMulTransB] use 0.000037 s Optype [AttentionMask] use 0.000035 s Optype [AddTo] use 0.000033 s Optype [MatMulTransB] use 0.000038 s Optype [Split] use 0.000039 s Optype [MatMul] use 0.000037 s Optype [CatDirect] use 0.000062 s Optype [CatDirect] use 0.000035 s Optype [Split] use 0.000038 s Optype [Linear] use 0.000034 s Optype [Split] use 0.000036 s Optype [Linear] use 0.000034 s Optype [Mul] use 0.000063 s Optype [RMSNorm] use 0.000033 s Optype [PermuteSelf] use 0.000031 s Optype [CatDirect] use 0.000065 s Optype [CatDirect] use 0.000036 s Optype [Embedding] use 0.000450 s Optype [RMSNorm] use 0.000030 s Optype [Split] use 0.000021 s Optype [MatMulTransB] use 0.000045 s Optype [RMSNorm] use 0.000058 s Optype [MatMulTransB] use 0.000063 s Optype [NearlyRotatePosition2D] use 0.000063 s Optype [Mul] use 0.000034 s Optype [Split] use 0.000039 s Optype [Linear] use 0.000036 s Optype [Split] use 0.000067 s Optype [CatDirect] use 0.000035 s Optype [Linear] use 0.000053 s Optype [Embedding] use 0.000438 s Optype [RMSNorm] use 0.000023 s Optype [MatMulTransB] use 0.000039 s Optype [AddTo] use 0.000029 s Optype [RMSNorm] use 0.000132 s Optype [MatMulTransB] use 0.000024 s Optype [MatMul] use 0.000049 s Optype [Linear] use 0.000022 s Optype [Linear] use 0.000029 s Optype [CatDirect] use 0.000026 s Optype [MatMulTransB] use 0.000024 s Optype [Swiglu] use 0.000034 s Optype [Embedding] use 0.000398 s Optype [RMSNorm] use 0.000022 s Optype [MatMulTransB] use 0.000037 s Optype [RMSNorm] use 0.000023 s Optype [NearlyRotatePosition2D] use 0.000044 s Optype [NearlyRotatePosition2D] use 0.000030 s Optype [CatDirect] use 0.000041 s Optype [MatMulTransB] use 0.000046 s Optype [Linear] use 0.000045 s Optype [AddTo] use 0.000043 s Optype [RMSNorm] use 0.000026 s Optype [CatDirect] use 0.000029 s Optype [MatMulTransB] use 0.000021 s Optype [AttentionMask] use 0.000021 s Optype [Mul] use 0.000035 s Optype [RMSNorm] use 0.000030 s Optype [CatDirect] use 0.000026 s Optype [CatDirect] use 0.000048 s Optype [MatMulTransB] use 0.000026 s Optype [Split] use 0.000032 s Optype [Mul] use 0.000033 s Optype [MatMul] use 0.000076 s Optype [PermuteSelf] use 0.000022 s

ztxz16 commented 1 year ago

这个打印的结果不是太准,因为Cuda kernel Launch之后没有sync,计时的时候可能操作并没有做完

融合算子的计划是有的,不过目前的框架的动态batch模式在融合算子的情况下实现比较麻烦,估计是比较长期的计划了。。

iamfaith commented 1 year ago

这个打印的结果不是太准,因为Cuda kernel Launch之后没有sync,计时的时候可能操作并没有做完

融合算子的计划是有的,不过目前的框架的动态batch模式在融合算子的情况下实现比较麻烦,估计是比较长期的计划了。。

感谢回复,我是在runop前后那里print的,理论上调完runop应该执行完了吧?

可以考虑先用静态batch做attention的融合

eigen2017 commented 1 year ago

这个打印的结果不是太准,因为Cuda kernel Launch之后没有sync,计时的时候可能操作并没有做完

融合算子的计划是有的,不过目前的框架的动态batch模式在融合算子的情况下实现比较麻烦,估计是比较长期的计划了。。

只支持batch即可,batch和融合atten算子有矛盾么大神? 动态batch,triton是直接支持的,我这里已经集成成功了: https://github.com/eigen2017/fastllmtritonbackend/ batch+流式接口集成到triton了,我找个时间也放出来。