Closed StrangeBytesDev closed 3 months ago
@StrangeBytesDev This issue was already fixed in version 3 beta.
Using the version 3 beta, to tokenize an input with special tokens you should enable the specialTokens
parameter:
import {fileURLToPath} from "url";
import path from "path";
import {getLlama} from "node-llama-cpp";
const __dirname = path.dirname(fileURLToPath(import.meta.url));
const llama = await getLlama();
const model = await llama.loadModel({
modelPath: path.join(__dirname, "models", "functionary-small-v2.2.q4_0.gguf")
});
const text = "<|from|>user\n<|content|>Hello";
console.log("With special tokens:", model.tokenize(text, true));
console.log("Without special tokens:", model.tokenize(text));
Oh awesome, I totally missed that. I like that its available optionally. I don't think I've seen any other library or API that has it as an option, and I can see some use cases where it would useful to have both. I'm having a bit of a hard time getting my head around how the tokenization in the generateCompletion function is handled. I'm under the impression that there isn't a way to enable the specialTokens param from a completion currently. Is that the case?
@StrangeBytesDev You can pass to the generateCompletion
function an array of tokens instead of a string - this way you can tokenize the input however you want
Issue description
Tokenization via LlamaContext.encode (or model.tokenize on V3) is significantly different than LlamaCPP
Expected Behavior
Tokenizing should be consistent between the LlamaCPP server, the tokenization example, and node-llama-cpp, especially pertaining to special tokens.
Actual Behavior
Tokenizing the following string using either LlamaCPP's tokenize example or with the server endpoint produces the following tokens. Model: Dolphin 2.6 Phi-2. Input:
<|im_start|>user\nHello<|im_end|>
Tokens:[ 50296, 7220, 198, 15496, 50295 ]
Tokenizing with LlamaContext.encode produces the following:
[27, 91, 320, 62, 9688, 91, 29, 7220, 198, 15496, 27, 91, 320, 62, 437, 91, 29]
Using functionary-small-v2.2.q4_0.gguf Input: "<|from|>user\n<|content|>Hello" LlamaCpp tokenizer or server endpoint:
[ 32002, 1838, 13, 32000, 16230 ]
LlamaContext.encode:[523, 28766, 3211, 28766, 28767, 1838, 13, 28789, 28766, 3789, 28766, 28767, 16230]
I also tested withwith Hermes-2-Pro-Mistral-7b and observed the same behavior.
Importantly, special tokens like "<|im_start|>" are being split up into individual tokens, "<", "|", etc. This has a huge impact on how a model interprets inputs.
Steps to reproduce
Tokenize with Node-llama-cpp
Tokenize with LlamaCPP tokenize example
Tokenize with LlamaCPP Server Start the llamaCPP server with functionary loaded
My Environment
node-llama-cpp
versionAdditional Context
The results above are all from 2.8.9. although I observed the same behavior with 3.0.0-beta.14
Relevant Features Used
Are you willing to resolve this issue by submitting a Pull Request?
Yes, I have the time, but I don't know how to start. I would need guidance.