withcatai / node-llama-cpp

Run AI models locally on your machine with node.js bindings for llama.cpp. Enforce a JSON schema on the model output on the generation level
https://node-llama-cpp.withcat.ai
MIT License
887 stars 86 forks source link

Problem when running some models with cuda #261

Closed bqhuyy closed 2 months ago

bqhuyy commented 3 months ago

Issue description

Models keep generating dummy result when running with cuda

Expected Behavior

Models stop generating dummy output like running as cpu or vulkan.

Actual Behavior

Models keep generating dummy result.

Steps to reproduce

I use this Qwen2 1.5B model download from here Running with gpu is auto or cuda

const llama = await getLlama({gpu: 'cuda'})

My Environment

Dependency Version
Operating System Windows 10
CPU AMD Ryzen 7 3700X
GPU RTX4090, RTX3080
Node.js version v20.11.1
Typescript version 5.5.2
node-llama-cpp version 3.0.0-beta.36

Additional Context

Here is example I running using https://github.com/withcatai/node-llama-cpp/releases/download/v3.0.0-beta.36/node-llama-cpp-electron-example.Windows.3.0.0-beta.36.x64.exe Screenshot 2024-07-01 165036

These models run normally with 'vulkan', 'cpu', 'metal'

Relevant Features Used

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

giladgd commented 3 months ago

There seems to be an issue with Qwen models when using CUDA. I've seen some suggestions to try other quantizations of a Qwen model since some may still work with CUDA, but I couldn't get any of the quantizations of the model you linked to work with CUDA. I've done some tests and can confirm this is an issue with llama.cpp and is not something specific to node-llama-cpp.

I'll make it easier to disable the use of CUDA (or any other compute layer you want), so you can force node-llama-cpp to not use it when you know there's an issue with it with a model you want to use.

anunknowperson commented 3 months ago

Try to use IQ quants with cuda.

bqhuyy commented 3 months ago

@giladgd Thank you for your feedback. I found that there is a suggestion of using FlashAttention to solve this problem. How can I enable it in node-llama-cpp?

giladgd commented 3 months ago

@bqhuyy I've released a new beta version that allows you to enable flash attention like this:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "Qwen2-1.5B-Instruct.Q4_K_M.gguf"),
    defaultContextFlashAttention: true // it's best to enable it via this setting
});

// you can also pass {flashAttention: true} here to enable it for only this context
const context = await model.createContext();

Let me know if you were able to use any Qwen models with CUDA without flash attention, because if it's impossible to use Qwen models with CUDA without flash attention, then I'll make flash attention enabled by default for Qwen models when CUDA is used. I haven't enabled flash attention by default for all models since it's still considered experimental, so it may not work well with all models, by if Qwen is unusable without it then it's better to have it enabled by default in this case.

bqhuyy commented 2 months ago

@giladgd hi, Qwen2 (CUDA) works with defaultContextFlashAttention: true

anunknowperson commented 2 months ago

@giladgd You can use Flash attention with Qwen2 as workaround for this bug, but it will work only with Cuda 12. With Cuda 11, FlashAttention will not change anything.

(source here: https://github.com/ggerganov/llama.cpp/issues/8025)

giladgd commented 2 months ago

I found some Qwen2 models that worked on CUDA 12 without flash attention, so since enabling flash attention for Qwen2 models is not always necessary, I won't make it the default because it is still considered experimental.

I'm closing this issue for now since defaultContextFlashAttention: true seems to solve this issue, and I'll make sure to mention using flash attention as a fix for this issue in the documentation of version 3.