Closed bqhuyy closed 2 months ago
There seems to be an issue with Qwen models when using CUDA.
I've seen some suggestions to try other quantizations of a Qwen model since some may still work with CUDA, but I couldn't get any of the quantizations of the model you linked to work with CUDA.
I've done some tests and can confirm this is an issue with llama.cpp
and is not something specific to node-llama-cpp
.
I'll make it easier to disable the use of CUDA (or any other compute layer you want), so you can force node-llama-cpp
to not use it when you know there's an issue with it with a model you want to use.
Try to use IQ quants with cuda.
@giladgd Thank you for your feedback. I found that there is a suggestion of using FlashAttention
to solve this problem. How can I enable it in node-llama-cpp
?
@bqhuyy I've released a new beta version that allows you to enable flash attention like this:
import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";
const __dirname = path.dirname(fileURLToPath(import.meta.url));
const llama = await getLlama();
const model = await llama.loadModel({
modelPath: path.join(__dirname, "models", "Qwen2-1.5B-Instruct.Q4_K_M.gguf"),
defaultContextFlashAttention: true // it's best to enable it via this setting
});
// you can also pass {flashAttention: true} here to enable it for only this context
const context = await model.createContext();
Let me know if you were able to use any Qwen models with CUDA without flash attention, because if it's impossible to use Qwen models with CUDA without flash attention, then I'll make flash attention enabled by default for Qwen models when CUDA is used. I haven't enabled flash attention by default for all models since it's still considered experimental, so it may not work well with all models, by if Qwen is unusable without it then it's better to have it enabled by default in this case.
@giladgd hi, Qwen2 (CUDA) works with defaultContextFlashAttention: true
@giladgd You can use Flash attention with Qwen2 as workaround for this bug, but it will work only with Cuda 12. With Cuda 11, FlashAttention will not change anything.
(source here: https://github.com/ggerganov/llama.cpp/issues/8025)
I found some Qwen2 models that worked on CUDA 12 without flash attention, so since enabling flash attention for Qwen2 models is not always necessary, I won't make it the default because it is still considered experimental.
I'm closing this issue for now since defaultContextFlashAttention: true
seems to solve this issue, and I'll make sure to mention using flash attention as a fix for this issue in the documentation of version 3.
Issue description
Models keep generating dummy result when running with cuda
Expected Behavior
Models stop generating dummy output like running as
cpu
orvulkan
.Actual Behavior
Models keep generating dummy result.
Steps to reproduce
I use this Qwen2 1.5B model download from here Running with
gpu
isauto
orcuda
My Environment
node-llama-cpp
versionAdditional Context
Here is example I running using https://github.com/withcatai/node-llama-cpp/releases/download/v3.0.0-beta.36/node-llama-cpp-electron-example.Windows.3.0.0-beta.36.x64.exe
These models run normally with 'vulkan', 'cpu', 'metal'
Relevant Features Used
Are you willing to resolve this issue by submitting a Pull Request?
Yes, I have the time, but I don't know how to start. I would need guidance.