withcatai / node-llama-cpp

Run AI models locally on your machine with node.js bindings for llama.cpp. Force a JSON schema on the model output on the generation level
https://node-llama-cpp.withcat.ai
MIT License
829 stars 80 forks source link

ggml_allocr_alloc: not enough space in the buffer #59

Closed Pixelycia closed 10 months ago

Pixelycia commented 11 months ago

Issue description

When I pass the prompt greater than 1000+ tokens - it fails with ggml_allocr_alloc error.

Expected Behavior

Should allow to consume prompt size equals to model's context size.

Actual Behavior

AI: ggml_allocr_alloc: not enough space in the buffer (needed 308691360, largest block available 281657952) GGML_ASSERT: /Users/pixelycia/Projects/node-llama-cpp/node_modules/node-llama-cpp/llama/llama.cpp/ggml-alloc.c:173: !"not enough space in the buffer" zsh: abort node-llama-cpp chat -m ./models/mistral-7b-openorca.Q5_K_M.gguf -c 8192

Steps to reproduce

I was able to reproduce error through the code and CLI:

  1. node-llama-cpp chat -m ./models/mistral-7b-openorca.Q5_K_M.gguf -c 8192
  2. Pass long prompt.

My Environment

Dependency Version
Operating System MacOS Sonoma 14.0
Node.js version v18.18.0
Typescript version 5.1.3
node-llama-cpp version ^2.5.1

Additional Context

I compiled latest llama.cpp and from original repository - it works perfectly fine.

Tried to clear \ re-download \ re-build with the build-in CLI tool "node-llama-cpp" with different releases \ metal \ no-metal support - fails with error all the time.

Relevant Features Used

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

giladgd commented 11 months ago

@Pixelycia As far as I've seen, the model you mentioned here doesn't support a context length of 8192. Try decreasing it to 2048 and it shouldn't fail then.

Pixelycia commented 11 months ago

@Pixelycia As far as I've seen, the model you mentioned here doesn't support a context length of 8192.

Try decreasing it to 2048 and it shouldn't fail then.

https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca - 8k context, anyway different models with smaller context fail on 1000+ tokens as well

giladgd commented 11 months ago

@Pixelycia I think it has something to do with how you convert the models to GGUF. I'm using pre-converted models by The Bloke on Hugging Face with long context without any problem. Does your Mac has an Intel chip or an Apple silicone chip? Also, how much RAM does your machine have?

Pixelycia commented 11 months ago

@Pixelycia I think it has something to do with how you convert the models to GGUF. I'm using pre-converted models by The Bloke on Hugging Face with long context without any problem. Does your Mac has an Intel chip or an Apple silicone chip? Also, how much RAM does your machine have?

Apple Silicon, M1, 32 GB RAM, I used different Bloke's GGUF models, no luck.

There is a difference I noticed how llama.cpp reacts on request and how this library. It look's like llama.cpp allocate space in RAM with batches (I can see my request appear in console sliced with delay), but this lib seems tries to allocate entire space at once.

giladgd commented 11 months ago

@Pixelycia I have M1 Max with 32GB RAM and it works for me. Can you please give me specific instructions on how to reproduce this issue? For example, what specific model by The Bloke to get, how to load it and what text to prompt it with. Please do this on the latest version of node-llama-app.

Also, does it still happen to you with version 2.7.0? I've updated the code to use a new llama.cpp interface that works differently than how it used to.

The delay you notice may be because only after you send the initial prompt using a LlamaChatSession object, node-llama-cpp starts evaluating the system prompt and everything else that precedes your prompt.

Pixelycia commented 10 months ago

@giladgd sorry for the delay in response, I'll do more tests and return to you

Pixelycia commented 10 months ago

@giladgd the error mentioned in the ticket gone, so we can close it, but probably I still do not understand something, if I put the long prompt in a fresh session context (bigger than a batch size) - it writes to me:

GGML_ASSERT: /Users/runner/work/node-llama-cpp/node-llama-cpp/llama/llama.cpp/llama.cpp:5746: n_tokens <= n_batch

and indeed it's true. To make my prompt work I need to increase batch size, that going to be bigger that the prompt or just equals to the context size. But why I should do it? What I'm doing wrong?

I'm trying to use the library not only in a chat mode, but in an instruction mode as well, i.e. I want to send an instruction with a big context and get a response in a single shot (for example I can recreate session on each prompt or implement different type of session class).

P.S. I used random TheBloke's model https://huggingface.co/TheBloke/MistralLite-7B-GGUF

P.S.2. I used code from the example:

import { fileURLToPath } from "url"
import path from "path"
import { LlamaChatSession, LlamaContext, LlamaModel } from "node-llama-cpp"

const __dirname = path.dirname(fileURLToPath(import.meta.url))

const model = new LlamaModel({
  modelPath: path.join(__dirname, "server/models", "mistrallite.Q5_K_M.gguf"),
})
const context = new LlamaContext({ model, contextSize: 4096, batchSize: 512 }) // Increate batchSize to 4096 to make it work
const session = new LlamaChatSession({ context })

const q1 = `...`

const a1 = await session.prompt(q1)
console.log("AI: " + a1)
giladgd commented 10 months ago

@Pixelycia The context size is the number of tokens the model can be aware of at once for generating a new token, and the batch size is the number of input tokens that can be processed on a single "GPU process" (this is an oversimplification to make it easier to understand).

I'm currently working on making the experience more seamless so you won't have to configure any of those, and everything will just work as you expect it to, but it's not ready yet. Still, the current implementation is more tightly coupled with llama.cpp's implementation, so in practice, you can only evaluate <batch size> amount of tokens at once.

Setting the batch size to be the same as the context size should suffice for your use case with the current version of node-llama-cpp.