withcatai / node-llama-cpp

Run AI models locally on your machine with node.js bindings for llama.cpp. Force a JSON schema on the model output on the generation level
https://node-llama-cpp.withcat.ai
MIT License
829 stars 80 forks source link

feat: version 3.0 #105

Open giladgd opened 9 months ago

giladgd commented 9 months ago

How to use this beta

To install the beta version of node-llama-cpp, run this command inside of your project:

npm install node-llama-cpp@beta

To get started quickly, generate a new project from a template by running this command:

npm create --yes node-llama-cpp@beta

The interface of node-llama-cpp will change multiple times before a new stable version is released, so the documentation of the new version will be updated only a bit before the stable version release. If you'd like to use this beta, visit this PR for updated examples of how to use the latest beta version.

How you can help

Included in this beta

Detailed changelog for every beta version can be found here

Planned changes before release

CLI usage

Chat with popular recommended models in your terminal with a single command:

npx --yes node-llama-cpp@beta chat

Check what GPU devices are automatically detected by node-llama-cpp in your project with this command:

npx --no node-llama-cpp inspect gpu

Run this command inside of your project directory

Usage example

Relevant for the 3.0.0-beta.39 version

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});

const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);

const q2 = "Summarize what you said";
console.log("User: " + q2);

const a2 = await session.prompt(q2);
console.log("AI: " + a2);

How to stream a response

Relevant for the 3.0.0-beta.39 version

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});

const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1, {
    onTextChunk(chunk) {
        process.stdout.write(chunk);
    }
});
console.log("AI: " + a1);

How to use function calling

Some models have official support for function calling in node-llama-cpp (such as Llama 3.1 Instruct and Llama 3 Instruct), while other models fallback to a generic function calling mechanism that works with many models, but not all of them.

Relevant for the 3.0.0-beta.39 version

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, defineChatSessionFunction, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context = await model.createContext();
const functions = {
    getDate: defineChatSessionFunction({
        description: "Retrieve the current date",
        handler() {
            return new Date().toLocaleDateString();
        }
    }),
    getNthWord: defineChatSessionFunction({
        description: "Get an n-th word",
        params: {
            type: "object",
            properties: {
                n: {
                    enum: [1, 2, 3, 4]
                }
            }
        },
        handler(params) {
            return ["very", "secret", "this", "hello"][params.n - 1];
        }
    })
};
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});

const q1 = "What is the second word?";
console.log("User: " + q1);

const a1 = await session.prompt(q1, {functions});
console.log("AI: " + a1);

const q2 = "What is the date? Also tell me the word I previously asked for";
console.log("User: " + q2);

const a2 = await session.prompt(q2, {functions});
console.log("AI: " + a2);

In this example I used this model

How to get embedding for text

Relevant for the 3.0.0-beta.39 version

import {fileURLToPath} from "url";
import path from "path";
import {getLlama} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const embeddingContext = await model.createEmbeddingContext();

const text = "Hello world";
const embedding = await embeddingContext.getEmbeddingFor(text);

console.log(text, embedding.vector);

How to customize binding settings

Relevant for the 3.0.0-beta.39 version

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama({
    logLevel: LlamaLogLevel.debug // enable debug logs from llama.cpp
});
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf"),
    onLoadProgress(loadProgress: number) {
        console.log(`Load progress: ${loadProgress * 100}%`);
    }
});
const context = await model.createContext();
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});

const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);

How to generate a completion

Relevant for the 3.0.0-beta.39 version

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaCompletion} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "stable-code-3b.Q5_K_M.gguf")
});
const context = await model.createContext();
const completion = new LlamaCompletion({
    contextSequence: context.getSequence()
});

const input = "const arrayFromOneToTwenty = [1, 2, 3,";
console.log("Input: " + input);

const res = await completion.generateCompletion(input);
console.log("Completion: " + res);

In this example I used this model

How to generate an infill

Infill, also known as fill-in-middle, is used to generate a completion for an input that should connect to a given continuation. For example, for a prefix input 123 and suffix input 789, the model is expected to generate 456 to make the final text be 123456789.

Not every model supports infill, so only those that do can be used for generating an infill.

Relevant for the 3.0.0-beta.39 version

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaCompletion, UnsupportedError} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "stable-code-3b.Q5_K_M.gguf")
});
const context = await model.createContext();
const completion = new LlamaCompletion({
    contextSequence: context.getSequence()
});

if (!completion.infillSupported)
    throw new UnsupportedError("Infill completions are not supported by this model");

const prefix = "const arrayFromOneToFourteen = [1, 2, 3, ";
const suffix = "10, 11, 12, 13, 14];";
console.log("prefix: " + prefix);
console.log("suffix: " + suffix);

const res = await completion.generateInfillCompletion(prefix, suffix);
console.log("Infill: " + res);

In this example I used this model

Using a specific compute layer

Relevant for the 3.0.0-beta.39 version

node-llama-cpp detects the available compute layers on the system and uses the best one by default. If the best one fails to load, it'll try the next best option and so on until it manages to load the bindings.

To use this logic, just use getLlama without specifying the compute layer:

import {getLlama} from "node-llama-cpp";

const llama = await getLlama();

To force it to load a specific compute layer, you can use the gpu parameter on getLlama:

import {getLlama} from "node-llama-cpp";

const llama = await getLlama({
    gpu: "vulkan" // defaults to `"auto"`. can also be `"cuda"` or `false` (to not use the GPU at all)
});

To inspect what compute layers are detected in your system, you can run this command:

npx --no node-llama-cpp inspect gpu

If this command fails to find CUDA or Vulkan although using getLlama with gpu set to one of them works, please open an issue so we can investigate it

Using TemplateChatWrapper

Relevant for the 3.0.0-beta.39 version

To create a simple chat wrapper to use in a LlamaChatSession, you can use TemplateChatWrapper.

For more advanced cases, implement a custom wrapper by inheriting ChatWrapper.

Example usage:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession, TemplateChatWrapper} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const chatWrapper = new TemplateChatWrapper({
    template: "{{systemPrompt}}\n{{history}}model:{{completion}}\nuser:",
    historyTemplate: "{{roleName}}: {{message}}\n",
    modelRoleName: "model",
    userRoleName: "user",
    systemRoleName: "system", // optional
    // functionCallMessageTemplate: { // optional
    //     call: "[[call: {{functionName}}({{functionParams}})]]",
    //     result: " [[result: {{functionCallResult}}]]"
    // }
});
const session = new LlamaChatSession({
    contextSequence: context.getSequence(),
    chatWrapper
});

const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);

const q2 = "Summarize what you said";
console.log("User: " + q2);

const a2 = await session.prompt(q2);
console.log("AI: " + a2);

{{systemPrompt}} is optional and is replaced with the first system message (when is does, that system message is not included in the history).

{{history}} is replaced with the chat history. Each message in the chat history is converted using template passed to historyTemplate, and all messages are joined together.

{{completion}} is where the model's response is generated. The text that comes after {{completion}} is used to determine when the model has finished generating the response, and thus is mandatory.

functionCallMessageTemplate is used to specify the format in which functions can be called by the model and how their results are fed to the model after the function call.

Using JinjaTemplateChatWrapper

Relevant for the 3.0.0-beta.39 version

You can use an existing Jinja template by using JinjaTemplateChatWrapper, but note that not all the functionality of Jinja is supported yet. If you want to create a new chat wrapper from scratch, using this chat wrapper is not recommended, and instead you better inherit from the ChatWrapper class and implement a custom chat wrapper of your own in TypeScript

Example usage:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession, JinjaTemplateChatWrapper} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const chatWrapper = new JinjaTemplateChatWrapper({
    template: "<Jinja template here>"
});
const session = new LlamaChatSession({
    contextSequence: context.getSequence(),
    chatWrapper
});

const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);

Custom memory management options

Relevant for the 3.0.0-beta.39 version

node-llama-cpp adapt to the current free VRAM state to choose the best default gpuLayers and contextSize values that maximize those values values within the available VRAM. It's best to not customize gpuLayers and contextSize in order to utilize this feature, but you can also set a gpuLayers value with your constraints, and node-llama-cpp will try to adapt to it.

node-llama-cpp also predicts how much VRAM is needed to load a model or create a context when you pass a specific gpuLayers or contextSize value, and throws an error if it's not enough VRAM in order to make sure the process won't crash if there's not enough VRAM. Those estimations are not always accurate, so if you find that it throws an error when it shouldn't, you can pass ignoreMemorySafetyChecks to force node-llama-cpp to ignore those checks. Also, in case those calculations are way too inaccurate, please let us know here, and attach the output of npx --no node-llama-cpp inspect measure <model path> with a link to the model file you used.

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession, JinjaTemplateChatWrapper} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf"),
    gpuLayers: {
        min: 20,
        fitContext: {
            contextSize: 8192 // to make sure there will be enough VRAM left to create a context with this size
        }
    }
});
const context = await model.createContext({
    contextSize: {
        min: 8192 // will throw an error if a context with this context size cannot be created
    }
});

Token bias

Relevant for the 3.0.0-beta.39 version

Here is an example of to increase the probability of the word "hello" being generated and prevent the word "day" from being generated:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession, TokenBias} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});

const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1, {
    tokenBias: (new TokenBias(model))
        .set("Hello", 1)
        .set("hello", 1)
        .set("Day", "never")
        .set("day", "never")
        .set(model.tokenize("day"), "never") // you can also do this to set bias for specific tokens
});
console.log("AI: " + a1);

Prompt preloading

Preloading a prompt while the user is still typing can make the model start generating a response to the final prompt much earlier, as it builds most of the context state needed to generate the response.

Relevant for the 3.0.0-beta.39 version

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});

const q1 = "Hi there, how are you?";

await session.preloadPrompt(q1);

console.log("User: " + q1);

// now prompting the model will start generating a response much ealier
const a1 = await session.prompt(q1);
console.log("AI: " + a1);

Prompt completion

Prompt completion is a feature that allows you to generate a completion for a prompt without actually prompting the model.

The completion is context-aware and is generated based on the prompt and the current context state.

When a completion for a prompt there's no use to preloading a prompt before generating a completion for it, as the completion method will preload the prompt automatically.

Relevant for the 3.0.0-beta.39 version

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});

const partialPrompt = "What is the best ";
console.log("Partial prompt: " + partialPrompt);

const completion = await session.completePrompt(partialPrompt);
console.log("Completion: " + completion);

Pull-Request Checklist

nathanlesage commented 7 months ago

Hey, I have switched to the beta due to the infamous n_tokens <= n_batch error, and I saw that it is now possible to automatically detect the correct context size. However, there is a problem with that: I have been trying this out with Mistral's OpenOrca 7b in the Q4_K_M quantized size, and the issue is that the training context is 2^15 (32,768), but the quantized version reduces this context to 2,048. With your code example, this will immediately crash the entire server when using your code, since contextSize: Math.min(4096, model.trainContextSize) will in this case resolve to contextSize: Math.min(4096, 32768) and then contextSize: 4096 which is > 2048.

I know that it's not always possible to detect the correct context length, but it would be great if this would not crash the entire app, and instead, e.g., throw an error.

Is it possible to add a mechanism to not crash the module if the provided context size is different from the training context size?

giladgd commented 7 months ago

@nathanlesage I'm pretty sure that the reason your app crashes is that larger context size requires more VRAM, and your machine doesn't have enough VRAM for a context length of 4096 but has enough for 2048. If you try to create a context with a larger size than is supported by the model, it won't crash your app but may cause the model to generate gibberish as it crosses the supported context length size.

Unfortunately, it's not possible to safeguard against this at the moment on node-llama-cpp's side since llama.cpp is the one that crashes the process, and node-llama-cpp is not aware of the available VRAM and memory requirements for creating a context with a specific size.

To mitigate this issue I've created this feature request on llama.cpp: https://github.com/ggerganov/llama.cpp/issues/4315 After this feature is added on llama.cpp I'll be able to improve this situation on node-llama-cpp's side.

If this issue is something you expect to happen frequently in your application lifecycle, you can wrap your code with a worker thread until this is fixed properly.

nathanlesage commented 7 months ago

I thought that at first, but then I tried the same code on a windows computer, also with 16 GB of RAM, and it didn't crash. Then I tried out the most recent llama.cpp "manually" (I.e., pulled and ran main) and it worked even with the larger context sizes. I'm beginning to think that this was a bug in the metal code of llama.cpp -- I'll try out beta.2 that you just released, that should fix the issue hopefully.

And thanks for the tip with the worker, I begin to feel a bit stupid for not realizing this earlier, but I've never worked so closely with native code in node before 🙈

hiepxanh commented 7 months ago

@giladgd hi, I think the embedding Fn, can you follow the interface? EmbeddingsInterface here https://github.com/langchain-ai/langchainjs/blob/5df71ccbc734f41b79b486ae89281c86fbb70768/langchain-core/src/embeddings.ts#L9

image

mstankala commented 6 months ago

I'm missing the LlamaContext.decode() function for the tokens, when streaming the prompt using the chat session onToken-"event": chatSession.prompt(message, onToken(chunk) { console.debug(context.decode(chunk)); // ? //... } }

Is there a substitution for that ?

eskan commented 6 months ago

@mstankala :

model.detokenize(chunk)

ruochenjia commented 6 months ago

169: Please fix this before the release

scenaristeur commented 6 months ago

transfered to a new discussion https://github.com/withcatai/node-llama-cpp/discussions/176

giladgd commented 6 months ago

@scenaristeur As you can see from the logs, node-llama-cpp detected that you have Vulkan and used it by default. It's still not smart enough to only offload as much stuff to the GPU as could be fitted in its VRAM, but I plan to implement that in one of the next few beta versions.

For now, you can either disable the GPU support by passing gpu: false to getLlama:

import {getLlama} from "node-llama-cpp";

const llama = await getLlama({
    gpu: false
});

Or you can lower the context size to make the context consume much less VRAM, to a level that could be fitted in your GPU's VRAM. To inspect how much VRAM you have, you can run this command:

npx --no node-llama-cpp inspect gpu
scenaristeur commented 6 months ago

@scenaristeur As you can see from the logs, node-llama-cpp detected that you have Vulkan and used it by default. It's still not smart enough to only offload as much stuff to the GPU as could be fitted in its VRAM, but I plan to implement that in one of the next few beta versions.

For now, you can either disable the GPU support by passing gpu: false to getLlama:

import {getLlama} from "node-llama-cpp";

const llama = await getLlama({
    gpu: false
});

Or you can lower the context size to make the context consume much less VRAM, to a level that could be fitted in your GPU's VRAM. To inspect how much VRAM you have, you can run this command:

npx --no node-llama-cpp inspect gpu

it's a 16 core CPU only, no GPU, i'll try getLlama with GPU false . thxs. Perharps i've istalled some Vulkan tools trying some llm but it's a CPU only

scenaristeur commented 6 months ago

@scenaristeur As you can see from the logs, node-llama-cpp detected that you have Vulkan and used it by default. It's still not smart enough to only offload as much stuff to the GPU as could be fitted in its VRAM, but I plan to implement that in one of the next few beta versions.

For now, you can either disable the GPU support by passing gpu: false to getLlama:

import {getLlama} from "node-llama-cpp";

const llama = await getLlama({
    gpu: false
});

Or you can lower the context size to make the context consume much less VRAM, to a level that could be fitted in your GPU's VRAM. To inspect how much VRAM you have, you can run this command:

npx --no node-llama-cpp inspect gpu

thxs , Works with "gpu:false", but i've lost conversationHistory, how to deal with conversationHistory in the beta version ? I'm working on a server where there can be multiple sessions, with multiple history, in what format should history be injected to a session ? to which class ?

giladgd commented 6 months ago

@scenaristeur Please open a discussion for your questions; I don't want to spam with notifications everyone who watches this PR

nathanlesage commented 6 months ago

@scenaristeur As you can see from the logs, node-llama-cpp detected that you have Vulkan and used it by default. It's still not smart enough to only offload as much stuff to the GPU as could be fitted in its VRAM, but I plan to implement that in one of the next few beta versions.

For now, you can either disable the GPU support by passing gpu: false to getLlama:


import {getLlama} from "node-llama-cpp";

const llama = await getLlama({

    gpu: false

});

Or you can lower the context size to make the context consume much less VRAM, to a level that could be fitted in your GPU's VRAM. To inspect how much VRAM you have, you can run this command:


npx --no node-llama-cpp inspect gpu

thxs ,

Works with "gpu:false", but i've lost conversationHistory, how to deal with conversationHistory in the beta version ? I'm working on a server where there can be multiple sessions, with multiple history, in what format should history be injected to a session ? to which class ?

If your application is GPL 3.0 compliant, feel free to inspire yourself here as to how that can go: https://github.com/nathanlesage/local-chat

scenaristeur commented 6 months ago

@scenaristeur Please open a discussion for your questions; I don't want to spam with notifications everyone who watches this PR

transfered to https://github.com/withcatai/node-llama-cpp/discussions/176

scenaristeur commented 5 months ago

switching from beta 13 to beta 14 gives me

file:///home/smag/dev/igora/node_modules/node-llama-cpp/dist/evaluator/LlamaModel.js:22
    constructor({ modelPath, gpuLayers, vocabOnly, useMmap, useMlock, onLoadProgress, loadSignal }, { _llama }) {
                                                                                                      ^

TypeError: Cannot destructure property '_llama' of 'undefined' as it is undefined.
    at new LlamaModel (file:///home/smag/dev/igora/node_modules/node-llama-cpp/dist/evaluator/LlamaModel.js:22:103)
    at new McConnector (file:///home/smag/dev/igora/src/mcConnector/index.js:30:13)
    at new Worker (file:///home/smag/dev/igora/src/worker/index.js:14:24)
    at file:///home/smag/dev/igora/index.js:28:16

Node.js v20.10.0
juned-adenwalla commented 4 months ago

I am using beta version and the code below still after asking question (giving prompt) i don't get any output its totally blank can anyone please help as I am stuck here badly

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama({
    gpu: false
});
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "phi-2-orange.Q2_K.gguf")
});
const context = await model.createContext({
    contextSize: Math.min(4096, model.trainContextSize)
});
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});

const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);

const q2 = "Summerize what you said";
console.log("User: " + q2);

const a2 = await session.prompt(q2);
console.log("AI: " + a2);
linonetwo commented 4 months ago

@juned-adenwalla You can ask it in Discussion.

I'm now using it, works fine with GPU. You might want to look at #199 about __dirname issue. Or try different model like https://huggingface.co/Qwen/Qwen1.5-32B-Chat-GGUF

nathanlesage commented 4 months ago

I have currently two issues preventing me from updating from beta.13 to beta.17.

  1. Beginning with beta.14, loading models always fails with defaults that have worked in beta.13 with the error message that allegedly the settings require more VRAM than I have (given that the same settings work in beta.13, I believe this to be incorrect). I unfortunately don't know if the bug originates with this library or with llama.cpp
  2. During development, the library itself will load fine, but after packed it will tell me that it requires the (old) ggml-meta.metal library which, as far as I can see, has been replaced with default.metallib – do you have an idea what might cause this? I am a bit confused that the library would complain about a missing library that appears to have been consciously replaced with a different target…?

It would be great if you could give me some pointers so that I can debug it!

(Also: You've mentioned in the changelog to beta.17 that the lib now supports Llama3, but I can confirm that beta.13 works fine with quantized Llama3-models!)

giladgd commented 4 months ago

@nathanlesage Have you seen my response for your message in the feedback discussion? I'd like to resolve your issues before I release version 3 as stable.


For anyone who sees this, please share your feedback on the version 3 beta feedback discussion and not on this PR.