withcatai / node-llama-cpp

Run AI models locally on your machine with node.js bindings for llama.cpp. Force a JSON schema on the model output on the generation level
https://withcatai.github.io/node-llama-cpp/
MIT License
735 stars 63 forks source link

Inconsistent tokenization/encoding #186

Closed StrangeBytesDev closed 3 months ago

StrangeBytesDev commented 3 months ago

Issue description

Tokenization via LlamaContext.encode (or model.tokenize on V3) is significantly different than LlamaCPP

Expected Behavior

Tokenizing should be consistent between the LlamaCPP server, the tokenization example, and node-llama-cpp, especially pertaining to special tokens.

Actual Behavior

Tokenizing the following string using either LlamaCPP's tokenize example or with the server endpoint produces the following tokens. Model: Dolphin 2.6 Phi-2. Input:<|im_start|>user\nHello<|im_end|> Tokens: [ 50296, 7220, 198, 15496, 50295 ]

Tokenizing with LlamaContext.encode produces the following: [27, 91, 320, 62, 9688, 91, 29, 7220, 198, 15496, 27, 91, 320, 62, 437, 91, 29]

Using functionary-small-v2.2.q4_0.gguf Input: "<|from|>user\n<|content|>Hello" LlamaCpp tokenizer or server endpoint: [ 32002, 1838, 13, 32000, 16230 ] LlamaContext.encode: [523, 28766, 3211, 28766, 28767, 1838, 13, 28789, 28766, 3789, 28766, 28767, 16230]

I also tested withwith Hermes-2-Pro-Mistral-7b and observed the same behavior.

Importantly, special tokens like "<|im_start|>" are being split up into individual tokens, "<", "|", etc. This has a huge impact on how a model interprets inputs.

Steps to reproduce

Tokenize with Node-llama-cpp

import path from "path"
import {LlamaModel, LlamaContext, LlamaChatSession} from "node-llama-cpp"

const model = new LlamaModel({
    modelPath: path.resolve("/path/to/functionary-small-v2.2.q4_0.gguf")
})
const context = new LlamaContext({model})
const prompt = `<|from|>user\n<|content|>Hello`
const tokens = context.encode(prompt)
console.log(tokens)

Tokenize with LlamaCPP tokenize example

./bin/tokenize /path/to/functionary-small-v2.2.q4_0.gguf "<|from|>user\n<|content|>Hello"

Tokenize with LlamaCPP Server Start the llamaCPP server with functionary loaded

const res = await fetch('http://localhost:8080/tokenize', {
    method: 'POST',
    body: JSON.stringify({
        content: '<|from|>user\n<|content|>Hello'
    }),
})
console.log(await res.json())

My Environment

Dependency Version
Operating System
CPU AMD Ryzen 5 PRO 5650U
Node.js version v10.11.0
node-llama-cpp version 2.8.9

Additional Context

The results above are all from 2.8.9. although I observed the same behavior with 3.0.0-beta.14

Relevant Features Used

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

giladgd commented 3 months ago

@StrangeBytesDev This issue was already fixed in version 3 beta.

Using the version 3 beta, to tokenize an input with special tokens you should enable the specialTokens parameter:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "functionary-small-v2.2.q4_0.gguf")
});

const text = "<|from|>user\n<|content|>Hello";

console.log("With special tokens:", model.tokenize(text, true));
console.log("Without special tokens:", model.tokenize(text));
StrangeBytesDev commented 3 months ago

Oh awesome, I totally missed that. I like that its available optionally. I don't think I've seen any other library or API that has it as an option, and I can see some use cases where it would useful to have both. I'm having a bit of a hard time getting my head around how the tokenization in the generateCompletion function is handled. I'm under the impression that there isn't a way to enable the specialTokens param from a completion currently. Is that the case?

giladgd commented 3 months ago

@StrangeBytesDev You can pass to the generateCompletion function an array of tokens instead of a string - this way you can tokenize the input however you want