Support file based prompt caching

withcatai / node-llama-cpp

Run AI models locally on your machine with node.js bindings for llama.cpp. Enforce a JSON schema on the model output on the generation level

https://node-llama-cpp.withcat.ai

MIT License

1.01k stars 93 forks source link

Support file based prompt caching #180

Open StrangeBytesDev opened 8 months ago

StrangeBytesDev commented 8 months ago

Feature Description

LlamaCPP is able to cache prompts to a specific file via the "--prompt-cache" flag. I think that exposing this through node-llama-cpp would provide for some techniques for substantial performance improvements which are otherwise impossible. For example, you could create specific cache files for individual conversations, and when you switch from one conversation to another, you're able to load the existing cache file, and not have to re-process the conversation history. You'd also be able to keep the cache available indefinitely, which is currently not possible via other caching mechanisms.

The Solution

Implement a config option to specify a prompt cache file.

Considered Alternatives

LlamaCPP server implements something similar with slots. With each request, you're able to specify a slot ID, and it will then utilize that existing prompt cache for the request. This works pretty well, but as each slot is kept in memory, limits the amount of slots that you can utilize at once, and doesn't preserve the cache between server restarts.

Additional Context

I'm able to work on this feature with a little guidance.

Related Features to This Feature Request

[ ] Metal support
[ ] CUDA support
[ ] Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

StrangeBytesDev commented 8 months ago

I think this function in llama.cpp might be the right one to call to try to implement this. https://github.com/ggerganov/llama.cpp/blob/b2440/llama.cpp#L14010

But I've never done any kind of C++ to Nodejs bindings before, so I'm doing my best to try and work through how that works and how to implement this here just by inferring from addon.cpp.

giladgd commented 8 months ago

I really like the idea :)

I've experimented with llama_load_session_file in the past, and have a few conclusions:

The main problem is that it holds the entire context state and not just the evaluation cache of the tokens used in a specific context sequence, so it cannot be used together with multiple sequences, thus eliminating the ability to do efficient batching this way.
It saves the entire context state, including all unused buffers, so the generated files are huge and can quickly fill up the storage of your machine if used frequently.
It depends on the specific implementation of the current binary, so if you update to the latest version of llama.cpp or node-llama-cpp (a new version of llama.cpp is released every few hours), every slight difference in the implementation that can affect how things are saved in memory and will make it impossible to load such a memory dump safely in another version without crashing or leading to memory corruptions.
IIRC (since I experimented with it months ago) It depends on the context size you created the context with and some other parameters, so the new context must match the parameters of the previous one you saved the state of, which can pretty easily lead to memory corruptions and crashes.

If you like you can try to add the ability to save and load only the evaluation cache of a context sequence to llama.cpp, which will solve most of the problems I encountered and make it viable to add support for in node-llama-cpp.

Madd0g commented 8 months ago

The main problem is that it holds the entire context state and not just the evaluation cache of the tokens used in a specific context sequence, so it cannot be used together with multiple sequences, thus eliminating the ability to do efficient batching this way.

I've used the oobabooga API for some batch tasks and it is noticeably fast for sequential large prompts if just the start of the text is the same but the ending is different. It seems to be a feature of llama-cpp-python? Is that a different implementation of the prefix caching?

I was hoping to benefit from this feature too, forgot that llama.cpp and the python version are two different things

giladgd commented 7 months ago

@Madd0g The way it works is that it reuses the existing context state for the new evaluation, and since the start of the current context state is the same, it allows it to start the evaluation of the new prompt at the first different token in the existing context state.

This feature already exists in node-llama-cpp, you just have to reuse the same context sequence across multiple chats. For example, using the version 3 beta, you can do this:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext({
    contextSize: Math.min(4096, model.trainContextSize)
});
const contextSequence = context.getSequence();
const session = new LlamaChatSession({
    contextSequence,
    autoDisposeSequence: false
});

const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);

session.dispose();

const session2 = new LlamaChatSession({
    contextSequence
});

const q1a = "Hi there";
console.log("User: " + q1a);

const a1a = await session2.prompt(q1a);
console.log("AI: " + a1a);

Madd0g commented 7 months ago

@giladgd - thanks I played around today with the beta.

I tried running on CPU, I'm looping over an array of strings, for me the evaluation takes longer only if I dispose and fully recreate the session.

reusing the contextSequence but recreating the session - no difference
autoDisposeSequence - true/false - no difference

I'm resetting the history in the loop to only keep the system message:

const context = await model.createContext({
  contextSize: Math.min(4096, model.trainContextSize),
});
const contextSequence = context.getSequence();
const session = new LlamaChatSession({
  systemPrompt,
  contextSequence,
  autoDisposeSequence: false,
});

// for of ....
// and in the loop to keep the system message:
    session.setChatHistory(session.getChatHistory().slice(0, 1));

am I doing something wrong?

StrangeBytesDev commented 7 months ago

The main problem is that it holds the entire context state and not just the evaluation cache of the tokens used in a specific context sequence, so it cannot be used together with multiple sequences, thus eliminating the ability to do efficient batching this way.

It saves the entire context state, including all unused buffers, so the generated files are huge and can quickly fill up the storage of your machine if used frequently.

It depends on the specific implementation of the current binary, so if you update to the latest version of llama.cpp or node-llama-cpp (a new version of llama.cpp is released every few hours), every slight difference in the implementation that can affect how things are saved in memory and will make it impossible to load such a memory dump safely in another version without crashing or leading to memory corruptions.

IIRC (since I experimented with it months ago) It depends on the context size you created the context with and some other parameters, so the new context must match the parameters of the previous one you saved the state of, which can pretty easily lead to memory corruptions and crashes.

I'm a little fuzzy on what you mean between the "entire context state" vs "evaluation cache" because I don't have a super solid conceptual idea of how things work under the hood on LlamaCPP for batching. It sounds to me like the existing prompt based caching would only really be useful for single user setups, and for short term caching. Is there a way to cache a context to disk on the Node side with the V3 beta? I'm assuming a naive attempt to do something like this won't actually work:

const context = await model.createContext({
    contextSize: 2048,
})
fs.writeFileSync("context.bin", context)

giladgd commented 7 months ago

@Madd0g It's not a good idea to manually truncate chat history like that just in order to reset it; you better create a new LlamaChatSession like in my example. The LlamaChatSession is just a wrapper around a LlamaContextSequence to facilitate chatting with a model, so there's no significant performance value to reusing that object.

The next beta version should be released next week and include a tokenMeter on every LlamaContextSequence that will allow you to see exactly how many tokens were evaluated and generated.

giladgd commented 7 months ago

@StrangeBytesDev A context can have multiple sequences, and each sequence has its own state and history. When you evaluate things on a specific sequence, other sequences are not affected and are not aware of the evaluation. Using multiple sequences on a single context has a performance advantage over creating multiple contexts with a single sequence on each, which is why I opted to expose that concept as-is in node-llama-cpp.

Since every sequence is supposed to be independent and have its own state, there shouldn't be any functions that have side effects that can affect other sequences when you only intend to affect a specific sequence.

The problem with llama_load_session_file is that it restores the state of a context with all of its sequences, which makes it incompatible with the concept of multiple independent sequences. While it can be beneficial when you only use a single sequence on a context, I opted not to add support for this due to the rest of the issues I mentioned earlier. Also, in order for the rest of the optimizations that node-llama-cpp employs to work properly after loading a context state from a file it would require creating a different file format with additional data that node-llama-cpp needs, so it won't be as simple as just exposing the native function in the JS side.

Madd0g commented 7 months ago

@Madd0g It's not a good idea to manually truncate chat history like that just in order to reset it; you better create a new LlamaChatSession like in my example.

Thanks, I initially couldn't get it to work without it retaining history, I was doing something wrong. Today I did manage to do it correctly, with something like this in a loop:

if (session) {
  session.dispose();
}
session = new LlamaChatSession({ contextSequence, systemPrompt, chatWrapper: "auto" })
console.log(session.getChatHistory());

I tried to get chatHistory out of the session and I correctly see only one the system message in there.

dabs9 commented 4 months ago

Hey @giladgd, thanks for all of your work in the library. I have a couple of questions (some of them related to this) and I didn’t know a better way to get in touch than to comment here. My questions:

When do you expect that the file based caching will be added? Will it be part of the v3.0.0 release?
I would be happy to contribute if you provide guidance. Is that something that would help get it over the line?
If I wanted to create a singleton model instance to pass around different parts of an application, is there a way you would recommend doing that? I don’t want to pay the cost of model loading multiple times.
I ran on Linux and it seemed like my script ran significantly faster after the first run. I do not see the same effect in MacOS. Is there any sort of implicit caching that could happen in the background?

Please let me know if there is a Discord or better way of getting in touch. You can reach me at info@getpoyro.com. Excited to chat and potentially collaborate on this issue!

giladgd commented 4 months ago

@dabs9 the file-based caching will be released as a non-breaking feature after the version 3 stable release. I just answered most of your questions in response to another comment here.

Contributions are welcome, but the specific feature of file-based caching will have to wait a bit for the feature of using the GPU of other machines first, so it can be implemented in a stable manner and avoid breaking changes. I'll let you know when that happens so you can help if you want.

I prefer to use GitHub Discussions for communications since it makes it easier for people new to this library to search for information in existing discussions, and relevant information shows up on Google, which is helpful when looking for stuff. I contemplated opening a Discord server, but I think GitHub Discussions is good enough for now.