ztxz16 / fastllm

纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行
Apache License 2.0
3.28k stars 333 forks source link

同时开启USE_CUDA和USE_MMAP会导致CUBLAS initialization failed #303

Closed TylunasLi closed 10 months ago

TylunasLi commented 1 year ago

测试环境:

问题:

在编译时指定参数

cmake -DUSE_CUDA=ON -DUSE_MMAP=ON

执行后报错:

AVX: ON
AVX2: OFF
AARCH64: OFF
Neon FP16: OFF
Neon DOT: OFF
Load (200 / 200)
Warmup...
CUBLAS initialization failed:1

错误1为 CUBLAS_STATUS_NOT_INITIALIZED

针对上述代码进行了GDB调试:

(gdb) backtrace

#0  getFastllmCublasHandle () at /home/nlp/inference/fastllm/src/devices/cuda/fastllm-cuda.cu:24
#1  0x0000000000498ed6 in FastllmCudaBatchMatMulTransB (input0=..., input1=..., output=..., input0Spatial=2048, input1Spatial=16384, outputSpatial=16, input0Stride=128,
    input1Stride=128, batch=2, n=16, m=128, k=1, alpha=0.0883883461) at /home/nlp/inference/fastllm/src/devices/cuda/fastllm-cuda.cu:1720
#2  0x00000000004884d2 in fastllm::CudaMatMulTransBOp::Run(std::string const&, std::map<std::string, fastllm::Data*, std::less<std::string>, std::allocator<std::pair<std::string const, fastllm::Data*> > > const&, std::map<std::string, float, std::less<std::string>, std::allocator<std::pair<std::string const, float> > > const&, std::map<std::string, int, std::less<std::string>, std::allocator<std::pair<std::string const, int> > > const&) () at /home/nlp/inference/fastllm/src/devices/cuda/cudadevice.cpp:354
#3  0x0000000000431298 in fastllm::BaseDevice::Run (this=0xa5c720, opType=..., datas=..., floatParams=..., intParams=...)
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/char_traits.h:312
#4  0x00000000004342eb in fastllm::Executor::Run(std::string const&, std::map<std::string, fastllm::Data*, std::less<std::string>, std::allocator<std::pair<std::string const, fastllm::Data*> > > const&, std::map<std::string, float, std::less<std::string>, std::allocator<std::pair<std::string const, float> > > const&, std::map<std::string, int, std::less<std::string>, std::allocator<std::pair<std::string const, int> > > const&) () at /home/nlp/inference/fastllm/src/executor.cpp:99
#5  0x0000000000422b68 in fastllm::MatMulTransB(fastllm::Data const&, fastllm::Data const&, fastllm::Data&, float) ()
    at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_tree.h:211
#6  0x00000000004546a4 in fastllm::ChatGLMModel::ForwardBatch(int, fastllm::Data const&, fastllm::Data const&, fastllm::Data const&, std::vector<std::pair<fastllm::Data, fastllm::Data>, std::allocator<std::pair<fastllm::Data, fastllm::Data> > >&, fastllm::GenerationConfig const&, fastllm::LastTokensManager const&, std::vector<std::vector<float, std::allocator<float> >*, std::allocator<std::vector<float, std::allocator<float> >*> >*) () at /home/nlp/inference/fastllm/src/models/chatglm.cpp:230
#7  0x000000000044b3d3 in fastllm::ChatGLMModel::Forward(fastllm::Data const&, fastllm::Data const&, fastllm::Data const&, std::vector<std::pair<fastllm::Data, fastllm::Data>, std::allocator<std::pair<fastllm::Data, fastllm::Data> > >&, fastllm::GenerationConfig const&, fastllm::LastTokensManager const&, std::vector<float, std::allocator<float> >*) () at /home/nlp/inference/fastllm/src/models/chatglm.cpp:77
#8  0x000000000044e018 in fastllm::ChatGLMModel::WarmUp() () at /home/nlp/inference/fastllm/src/models/chatglm.cpp:876
#9  0x0000000000431dbd in fastllm::CreateLLMModelFromFile(std::string const&) () at /home/nlp/inference/fastllm/src/model.cpp:91
#10 0x0000000000416251 in main () at /home/nlp/inference/fastllm/main.cpp:64
#11 0x00007fffed771555 in __libc_start_main () from /lib64/libc.so.6
#12 0x0000000000416abe in _start () at /home/nlp/inference/fastllm/main.cpp:98  

(gdb) frame 1

#1  0x0000000000498ed6 in FastllmCudaBatchMatMulTransB (input0=..., input1=..., output=..., input0Spatial=2048, input1Spatial=16384, outputSpatial=16, input0Stride=128,
    input1Stride=128, batch=2, n=16, m=128, k=1, alpha=0.0883883461) at /home/nlp/inference/fastllm/src/devices/cuda/fastllm-cuda.cu:1720
1830        auto fastllmCublasHandle = getFastllmCublasHandle();
(gdb) print cudaInput0
$5 = (float *) 0x7ffca0611000
(gdb) print cudaInput1
$6 = (float *) 0x0
(gdb) print cudaOutput
$7 = (float *) 0x0

问题应该是 CudaMemcpy() 不支持从mmap的内存地址进行拷贝。

TylunasLi commented 10 months ago

不过,使用MMAP后,权重需要先拷贝到内存,再拷贝到GPU,加载j较慢,算是以时间换空间了。