withcatai / node-llama-cpp

Run AI models locally on your machine with node.js bindings for llama.cpp. Force a JSON schema on the model output on the generation level
https://node-llama-cpp.withcat.ai
MIT License
829 stars 80 forks source link

feat: automatic batching #85

Open giladgd opened 10 months ago

giladgd commented 10 months ago

Also, automatically set the right contextSize and provide other good defaults to make the usage smoother.

github-actions[bot] commented 9 months ago

:tada: This issue has been resolved in version 3.0.0-beta.1 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

carlosgalveias commented 9 months ago

Just tried it and:

llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW)
llm_load_print_meta: general.name   = ehartford_dolphin-2.1-mistral-7b
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|im_end|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: mem required  = 4165.48 MiB
...............................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 4096.00 MiB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 136707.31 MiB
GGML_ASSERT: D:\a\node-llama-cpp\node-llama-cpp\llama\llama.cpp\llama.cpp:745: data

Tried to allocate 135Gb ? :) Here's the test I used:

import {
    LlamaModel,
    LlamaContext,
    LlamaChatSession,
    ChatMLChatPromptWrapper
} from "node-llama-cpp";

const model = new LlamaModel({
    modelPath: path.join("model", "dolphin-2.1-mistral-7b.Q4_K_M.gguf"),
});

const defaultSystemPrompt = 'You are a exceptional professional senior coder specialized in javascript and python talking with a human, you have exceptional attention to detail and when writing code you write well commented code while also describing every step.';
const context = new LlamaContext({ model });
const session = new LlamaChatSession({ context, promptWrapper: new ChatMLChatPromptWrapper(), systemPrompt: defaultSystemPrompt, printLLamaSystemInfo: true });
const interact = async function(prompt, id, cb) {
    id = id || 'test';
    try {
        await session.prompt(prompt, {
            onToken(chunk) {
                    cb(context.decode(chunk));
            }
        })
    } catch (e) {
      console.error(e)
    }
    return;
}

let prompt = `
Hi, please write a multi dimentional sort algorithm and explain all the code
`
const cb = function(text) {
  process.stdout.write(text);
}
const test = async function() {
  await interact(prompt, null, cb)
}

await test()
giladgd commented 9 months ago

@carlosgalveias I don't think you've installed the beta version, since I've just tried the model you mentioned here and it worked perfectly for me.

Make sure you install it this way:

npm install node-llama-cpp@beta

To use the 3.0.0-beta.1 version, do something like this:

import {fileURLToPath} from "url";
import path from "path";
import {LlamaModel, LlamaContext, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const model = new LlamaModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = new LlamaContext({model});
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});

const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);

const q2 = "Summerize what you said";
console.log("User: " + q2);

const a2 = await session.prompt(q2);
console.log("AI: " + a2);
giladgd commented 9 months ago

@carlosgalveias I think I know what issue you have encountered. The 3.0.0-beta.1 version sets the context size by default to the context size the model was trained on, and if this number is too big, then it will allocate a lot of memory for the big context. I plan to make this library provide better defaults in one of the future betas, so for now, try to manually limit the context size of the model.

So for the 3.0.0-beta.1 version, do something like this:

import {fileURLToPath} from "url";
import path from "path";
import {LlamaModel, LlamaContext, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const model = new LlamaModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = new LlamaContext({
    model,
    contextSize: Math.min(4096, model.trainContextSize)
});
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});

const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);

const q2 = "Summerize what you said";
console.log("User: " + q2);

const a2 = await session.prompt(q2);
console.log("AI: " + a2);
carlosgalveias commented 9 months ago

@giladgd gotcha, thanks!

Zambonilli commented 8 months ago

Will this change help with trying to do a lot of one-shot prompts in a loop? I'm seeing this error, could not find a KV slot for the batch (try reducing the size of the batch or increase the context) regardless of what I set the batch size or context size to. I am creating the model and context outside of the loop and creating a new session inside of the loop on every iteration. Since, these are one-shot prompts, I don't really care about llama2 having any session information.

giladgd commented 7 months ago

@Zambonilli Can you please open a bug issue for this? It'll help me to investigate and fix the issue. Please also include code I can run to reproduce the issue and a link to the specific model file you used.

Zambonilli commented 7 months ago

@giladgd no problem, here is the ticket.