withcatai / node-llama-cpp

Run AI models locally on your machine with node.js bindings for llama.cpp. Enforce a JSON schema on the model output on the generation level
https://node-llama-cpp.withcat.ai
MIT License
960 stars 91 forks source link

Error: vk::Queue::submit: ErrorDeviceLost #304

Closed billyma128 closed 1 month ago

billyma128 commented 2 months ago

Issue description

There is an error when generating response, which looks like Vulkan related issue. But ollama run the same model , which works well, Thanks for your time! Best Regarded!

Expected Behavior

I use node-llama-cpp in my real world project, which works very well until I get one particular laptop. I reproduce the error using cli below: npx --no node-llama-cpp chat --model models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf load model successful and generating response successful

Actual Behavior

npx --no node-llama-cpp chat --model models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf load model successful but generating response throw an error and exit

C:\Users\A\Documents\GitHub\foo-chat-desktop>npx --no node-llama-cpp chat --model models/tinyllama-1.1b-chat-v1.0.Q4_
K_M.gguf
Loading model                  0.000%
ggml_vulkan: Found 1 Vulkan devices:
√ Model loaded

Using this model ("C:\Users\A\Documents\GitHub\foo-chat-desktop\models\tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf") to tokenize text with special tokens and then detokenize it resulted in a different text. There might be an issue with the model or the tokenizer implementation. Using this model may not work as intended
√ Context created
GPU       Type: Vulkan   VRAM: 7.91GB   Name: Intel(R) Iris(R) Plus Graphics
Model     Type: llama 1B Q4_K - Medium   Size: 636.18MB   GPU layers: 23/23 offloaded (100%)   BOS: <s>   EOS: </s>
          Train context size: 2048
Context   Size: 2048
Chat      Wrapper: JinjaTemplate   Repeat penalty: 1.1 (apply to last 64 tokens)
> hi
AI: Sure, here's a revised version of the text with additional instructions:

You are a helpful, respectful and honest assistant. Always answer as helpfully as possible. If a question does not make any[Error: vk::Queue::submit: ErrorDeviceLost]
C:\Users\A\Documents\GitHub\foo-chat-desktop>npx --no node-llama-cpp chat --wrapper gemma --model models/gemma-2-2b-i
t-q2_k_0.gguf
Loading model                  0.000%
ggml_vulkan: Found 1 Vulkan devices:
√ Model loaded
√ Context created
GPU       Type: Vulkan   VRAM: 7.91GB   Name: Intel(R) Iris(R) Plus Graphics
Model     Type: gemma2 2B Q2_K - Medium   Size: 1.59GB   GPU layers: 27/27 offloaded (100%)   BOS: <bos>   EOS: <eos>
          Train context size: 8192
Context   Size: 8192
Chat      Wrapper: Gemma   Repeat penalty: 1.1 (apply to last 64 tokens)
> hi
AI: [Error: vk::Queue::submit: ErrorDeviceLost]

Steps to reproduce

cachemem cpuid npx --no node-llama-cpp chat --model models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

My Environment

Dependency Version
Operating System Windows 11 Professional 23H2 22631.4037
CPU Intel i5 1030NG7
Node.js version 20.17.0 lts
Typescript version 5.5.4
node-llama-cpp version 3.0.0-beta.44

Additional Context

I have upgraded the driver version of Intel iris graphics to mostly stable version: 31.0.101.2128 and restarted the computer

Relevant Features Used

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

giladgd commented 2 months ago

Ollama doesn't support Vulkan, so it doesn't use it. Can you please try to run the command with both --gpu false and --gpu vulkan and report whether there is any difference in inference speed? I'm trying to figure out whether the Vulkan device that is used on your machine is CPU or is it a GPU.

Also, can you please run this command and share its results?

npx --yes node-llama-cpp@beta inspect gpu
giladgd commented 1 month ago

Closing due to inactivity. If you still encounter issues with node-llama-cpp, let me know and I'll try to help.