withcatai / node-llama-cpp

Run AI models locally on your machine with node.js bindings for llama.cpp. Force a JSON schema on the model output on the generation level
https://node-llama-cpp.withcat.ai
MIT License
829 stars 80 forks source link

feat: automatic batching #104

Closed giladgd closed 9 months ago

giladgd commented 9 months ago

Description of change

BREAKING CHANGE: completely new API (docs will be updated before a stable version is released)

Closes #85 Fixes #102 Fixes #94 Fixes #93 Fixes #76

Things left to do (in other PRs)

Pull-Request Checklist

github-actions[bot] commented 9 months ago

:tada: This PR is included in version 3.0.0-beta.1 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

Madd0g commented 4 months ago

is there a code snippet that shows how to correctly use batching? I'm doing repetitive things in a loop and wondering how I might take advantage of this?

giladgd commented 4 months ago

@Madd0g There will be a better example in the documentation when version 3 leaves the beta status soon, but for now, here's a simple example:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext({
    sequences: 2
});

const sequence1 = context.getSequence();
const sequence2 = context.getSequence();

const session1 = new LlamaChatSession({
    contextSequence: sequence1
});
const session2 = new LlamaChatSession({
    contextSequence: sequence2
});

const q1 = "Hi there, how are you?";
const q2 = "How much is 6+6?";

const [
    a1,
    a2
] = await Promise.all([
    session1.prompt(q1),
    session2.prompt(q2)
]);

console.log("User: " + q1);
console.log("AI: " + a1);

console.log("User: " + q2);
console.log("AI: " + a2);

The batching is done automatically across sequences of the same context