withcatai / node-llama-cpp

Run AI models locally on your machine with node.js bindings for llama.cpp. Force a JSON schema on the model output on the generation level
https://withcatai.github.io/node-llama-cpp/
MIT License
736 stars 63 forks source link

Error: Conversation roles must alternate user/assistant/user/assistant/.. #263

Closed AliAzlanAziz closed 1 day ago

AliAzlanAziz commented 5 days ago

Issue description

Running same base prompt with dynamic text in it to analyze it, getting the error at 100

Expected Behavior

I should received output in the format: "Answer: <<0 or 1>>. Text: <>" as I am receiving other outputs like; ... Answer: 0. Text: Stars swim in the ocean of infinite poss... Answer: 0. Text: Moonlit owls hoot the symphony of the en... Answer: 0. Text: Cotton clouds float on the breeze of whi... ...

Actual Behavior

Error: Conversation roles must alternate user/assistant/user/assistant/

node-llama-cpp-project/node_modules/@huggingface/jinja/dist/index.js:1555 throw new Error(args); ^

Error: Conversation roles must alternate user/assistant/user/assistant/... at ~/node-llama-cpp-project/node_modules/@huggingface/jinja/dist/index.js:1555:13 at FunctionValue.value (~/node-llama-cpp-project/node_modules/@huggingface/jinja/dist/index.js:1529:24) at Interpreter.evaluateCallExpression (~/node-llama-cpp-project/node_modules/@huggingface/jinja/dist/index.js:1328:15) at Interpreter.evaluate (~/node-llama-cpp-project/node_modules/@huggingface/jinja/dist/index.js:1493:21) at Interpreter.evaluateBlock (~/node-llama-cpp-project/node_modules/@huggingface/jinja/dist/index.js:1300:34) at Interpreter.evaluateIf (~/node-llama-cpp-project/node_modules/@huggingface/jinja/dist/index.js:1411:17) at Interpreter.evaluate (~/node-llama-cpp-project/node_modules/@huggingface/jinja/dist/index.js:1466:21) at Interpreter.evaluateBlock (~/node-llama-cpp-project/node_modules/@huggingface/jinja/dist/index.js:1300:34) at Interpreter.evaluateFor (~/node-llama-cpp-project/node_modules/@huggingface/jinja/dist/index.js:1452:30) at Interpreter.evaluate (~/node-llama-cpp-project/node_modules/@huggingface/jinja/dist/index.js:1468:21)

Steps to reproduce

initializeLlama.ts

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession, LlamaJsonSchemaGrammar, LlamaLogLevel} from "node-llama-cpp";

let session: LlamaChatSession;
let llamaGrammar: LlamaJsonSchemaGrammar<any>;

const initLlama = async () => {
  const __dirname = path.dirname(fileURLToPath(import.meta.url));
  const modelsFolderDirectory = path.join(__dirname, "../../", "models");

  const llama = await getLlama({
    // gpu: false,
    gpu: 'cuda',
    // logLevel: LlamaLogLevel.debug
  });

  const model = await llama.loadModel({
    // modelPath: path.join(modelsFolderDirectory, "llama-2-7b-chat.Q4_K_M.gguf"),
    modelPath: path.join(modelsFolderDirectory, "Mistral-7B-Instruct-v0.3.Q5_K_M.gguf"),
    gpuLayers: 'max'
  });
  console.log("Model Loaded");

  const context = await model.createContext();
  console.log("Context Created");

  session = new LlamaChatSession({
    contextSequence: context.getSequence()
  });
  console.log("Session Initialized");

}

export {
  session,
  llamaGrammar
}

export default initLlama;

analyzeAndRate.ts

import { texts } from "./dummy/dummyText"; // link is just below the code
import { llamaGrammar, session } from "./llama/initializeLlama";

export const runAnaylyzerAndRate = async (text: string) => {
    const query = `Answer 1 if the given text is semantically correct and sounds like a job experience else answer 0, do not write any text in the answer except for 0 and 1. Text: ${text}`

    const startTime = Date.now();

    const response = await session.prompt(query);

    const endTime = Date.now();
    const duration = (endTime - startTime)/1000;

    console.log(`Answer: ${response}. Text: ${text.substring(0, 40)}...`);
}

export const run = async () => {
    console.log('Analyzing meaning full text!')
    for(let i=0; i<texts.length; i++){
        console.log(`Index: ${i+1}`)
        await runAnaylyzerAndRate(texts[i].text)
    }
}

import text from here -> dummyText.ts pastebin file only valid for 1 year from today

My Environment

Dependency Version
Operating System Windows 11
CPU AMD Ryzen 7 7735HS with Radeon Graphics 3.20 GHz
Node.js version 20.11.0
Typescript version 5.4.5
node-llama-cpp version 3.0.0-beta.36

Relevant Features Used

Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.

giladgd commented 3 days ago

This issue appears to be related to the chat template provided by the model, where it expects to receive a message array with non-standard role names. To resolve this issue, you can print the chat template provided by the model and investigate it to find the role names that the model expects:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession, JinjaTemplateChatWrapper} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});

console.log(model.fileInfo.metadata.tokenizer.chat_template);

You can also run this command to print all the GGUF file metadata, which includes the chat template:

npx --no node-llama-cpp inspect gguf <path to model>

After you find the roles that the model expects, you can customize the JinjaTemplateChatWrapper used by the chat session like this:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession, JinjaTemplateChatWrapper} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const session = new LlamaChatSession({
    contextSequence: context.getSequence(),
    chatWrapper: new JinjaTemplateChatWrapper({
        template: model.fileInfo.metadata.tokenizer.chat_template!,
        modelRoleName: "assistant", // change this to what you find the chat template
        userRoleName: "user" // change this to what you find the chat template
    })
});

If you can provide me with a link to the model file you used it'll help me investigate the issue further.

AliAzlanAziz commented 2 days ago

Still receiving the error after your suggested changes, details provided below:

updated my LlamaChatSession according to guide as follows:

const chatWrapper = new JinjaTemplateChatWrapper({
  template: model.fileInfo.metadata.tokenizer.chat_template!,
  modelRoleName: 'assistant',
  userRoleName: 'user',
})

session = new LlamaChatSession({
  contextSequence: context.getSequence(),
  chatWrapper
});

this is what console.log(model.fileInfo.metadata.tokenizer.chat_template!) logged into the console; _{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raiseexception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}

after running command: npx --no node-llama-cpp inspect gguf .\models\Mistral-7B-Instruct-v0.3.Q5_K_M.gguf >> Mistral-7B-Instruct-v0.3_Q5_K_M.txt

Mistral-7B-Instruct-v0.3_Q5_K_M.txt

File: ~\node-llama-cpp-project\models\Mistral-7B-Instruct-v0.3.Q5_K_M.gguf
GGUF version: 3
Tensor count: 291
Metadata size: 723.27KB
Tensor info size: 16.88KB
File type: MOSTLY_Q5_K_M (17)
Metadata: {
    general: {
        architecture: "llama",
        name: "models--mistralai--Mistral-7B-Instruct-v0.3",
        file_type: 17,
        quantization_version: 2
    },
    llama: {
        block_count: 32,
        context_length: 32_768,
        embedding_length: 4_096,
        feed_forward_length: 14_336,
        attention: {
            head_count: 32,
            head_count_kv: 8,
            layer_norm_rms_epsilon: 0
        },
        rope: {
            freq_base: 1_000_000,
            dimension_count: 128
        },
        vocab_size: 32_768
    },
    tokenizer: {
        ggml: {
            model: "llama",
            pre: "default",
            tokens: ["<unk>", "<s>", "</s>", "[INST]", "[/INST]", "[TOOL_CALLS]", "[AVAILABLE_TOOLS]", "[/AVAILABLE_TOOLS]", "[TOOL_RESULTS]", "[/TOOL_RESULTS]", ...32758 more items],
            scores: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...32758 more items],
            token_type: [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...32758 more items],
            bos_token_id: 1,
            eos_token_id: 2,
            unknown_token_id: 0,
            add_bos_token: true,
            add_eos_token: false
        },
        chat_template: "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}"
    },
    quantize: {
        imatrix: {
            file: "./imatrix.dat",
            dataset: "group_40.txt",
            entries_count: 224,
            chunks_count: 74
        }
    }
}
Tensor info: [{name: "token_embd.weight", dimensions: [4_096, 32_768], ggmlType: 13, offset: 0}, {name: "blk.0.attn_norm.weight", dimensions: [4_096], ggmlType: 0, offset: 92_274_688}, {name: "blk.0.ffn_down.weight", dimensions: [14_336, 4_096], ggmlType: 14, offset: 92_291_072}, {name: "blk.0.ffn_gate.weight", dimensions: [4_096, 14_336], ggmlType: 13, offset: 140_460_032}, ...287 more items]

NOTE: You asked for the model link, well I searched it again and the model link is no more valid as the owner seems to have removed the model file. The model name was "Mistral-7B-Instruct-v0.3"(file Mistral-7B-Instruct-v0.3.Q5_K_M.gguf) by TheBloke at HuggingFace. I am right now downloading Mistral-7B-Instruct-v0.2.Q5_K_M.gguf to try, I will comment again if the error persist on this model and will share with you the link as well.

AliAzlanAziz commented 2 days ago

@giladgd Getting same error on the v0.2 of the same model, just click the link to start downloading the Mistral-7B-Instruct-v0.2.Q5_K_M.gguf

AliAzlanAziz commented 1 day ago

@giladgd can you please tell me what should I do to resolve the issue? I have been assigned some work that is due in 3 4 days, I have written a lot of code for other tasks in nodejs including the above one otherwise I would have switched to python to run this model. Please help me resolving it if you can, I can understand you might be really busy as a sole contributor of this whole project (node-llama-cpp).

giladgd commented 1 day ago

I've just run your code with the model you linked and it run correctly. I've used a Windows 10 machine with Nvidia A6000 together with your code. I've put the output here

giladgd commented 1 day ago

Try to scaffold a new project from a template and run it without any modification to see whether it works for you, so we can figure out whether the issue is related to your code, machine, OS or configuration:

npm create --yes node-llama-cpp@beta
AliAzlanAziz commented 1 day ago

If you are asking me to run the index.ts with the default code that it comes with when we create a new project with command; npm create --yes node-llama-cpp@beta well then it works fine, but there are just 5 prompts in the index.ts. What's surprising for me is that it worked on your system totally fine! lol, great anyways. It could be my GPU (less powerful or incompatible) or OS problem. Thank you for the immediate response.

AliAzlanAziz commented 1 day ago

By the way, can you guess or give me some tip to debug the problem? @giladgd

giladgd commented 1 day ago

The issue that you encountered is related to the Jinja template of the model. node-llama-cpp runs a sanity check with a given Jinja template before using it to ensure that all the messages of the chat history actually appear in the result text, so I think the error you get may be related to failing that sanity check.

It may be possible that there's some feature that is used by the Jinja template that's not available on nodejs 20. Try updating to nodejs 22 and let me know whether it fixed the issue for you.

As a last resort, you can try to use GeneralChatWrapper instead of the Jinja template of the model by passing chatWrapper: new GeneralChatWrapper() when creating a chat session. It may not perform as good as using the model's template, but it may still be good enough.

Also, try running this command to inspect your GPU:

npx --no node-llama-cpp inspect GPU

Your machine may not have enough RAM of VRAM to run everything correctly, and this is the easiest way to see that.

Unless you specifically know you need it, it's best not to specify a specific gpuLayers (or gpuLayers: 'max' in your case) as node-llama-cpp measures your system and attempts to optimize all parameters for best performance, and that may mean not offloading all the layers to the GPU so there will be enough VRAM left for large enough context for example.

AliAzlanAziz commented 1 day ago

Though I pasted the code with gpuLayers set to max but I did test it with the default settings too and I have been testing it without setting gpuLayers since then. Also I ran the gpu command before starting the program and 3 4 times in between the program before it exited in error and all of the time it printed the exact same % (so perhaps it ain't giving me real time stats of the gpu);

[Updated node to 22, as per suggestion] `OS: Windows 10.0.22631 (x64) Node: 22.4.0 (x64) TypeScript: 5.5.3 node-llama-cpp: 3.0.0-beta.36

CUDA: available Vulkan: available

CUDA device: NVIDIA GeForce RTX 4060 Laptop GPU CUDA used VRAM: 13.29% (1.06GB/8GB) CUDA free VRAM: 86.7% (6.93GB/8GB)

Vulkan devices: NVIDIA GeForce RTX 4060 Laptop GPU, AMD Radeon(TM) Graphics Vulkan used VRAM: 1.09% (90MB/8.02GB) Vulkan free VRAM: 98.9% (7.93GB/8.02GB)

CPU model: AMD Ryzen 7 7735HS with Radeon Graphics Used RAM: 61.1% (9.28GB/15.19GB) Free RAM: 38.89% (5.91GB/15.19GB)`

However the last resort chatWrapper: new GeneralChatWrapper() saved my life, thank you so much. love you bro :) Also nvm, I am quite new or you can say like it is my first time working with llms or with any model of ML/AI, so it could be my shortcoming as well. Q: By the way does the prompt answer depends or is adjusted based on the previous prompt if you know? @giladgd

giladgd commented 1 day ago

Glad you got it working :) A chat session holds all the previous messages, and next prompts will be answered in the context of the chat history in that chat session. If you'd like to prompt with a clean history, you can reset the chat session like this:

const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});
const initialChatHistory = session.getChatHistory();

const a1 = session.prompt(q1);
session.setChatHistory(initialChatHistory); // reset the chat session

const a2 = session.prompt(q2); // the response here will not be aware of the previous prompt

I couldn't reproduce the issue you encountered, so since you have a solution by now I'm closing this issue. If you find the cause or have a solution for how you fixed it (so I can investigate the cause), then let me know, and I'll reopen the issue.

AliAzlanAziz commented 1 day ago

I tried setting of history to initialChatHistory (which is basically empty) after every prompt and after that it worked with default chat template too. I mean I commented out the code chatWrapper: new GeneralChatWrapper() with resetting history after every prompt and I am not receiving the error "Error: Conversation roles must alternate user/assistant/user/assistant/...".

Though my use case here forces me to reset chat history but it ain't a valid solution to the error. Just updating you cause I though it might help you debugging to the root cause. @giladgd