withcatai / node-llama-cpp

Run AI models locally on your machine with node.js bindings for llama.cpp. Force a JSON schema on the model output on the generation level
https://withcatai.github.io/node-llama-cpp/
MIT License
760 stars 65 forks source link

EOS token is not detected properly for some models after upgrading to v3.0 #169

Closed ruochenjia closed 4 months ago

ruochenjia commented 5 months ago

Issue description

EOF token not detected for some models

Expected Behavior

model.tokens.eos should be a non-null value after loading the model, and the sequence.evaluate call should stop (exit for await loop) without any additional break statements when the generation is completed.

Actual Behavior

The generating process continues with repeated or random non-related contents, and the EOS token is printed in the generated text as <dummy32000>.

Currently you have to manually check from the EOS token in the loop in order to stop generating, and model.tokens.eos is always null.

let response = "";

for await (const token of sequence.evaluate(model.tokenize(message, true), {
    topK: 40,
    topP: 0.4,
    temperature: 0.8,
    evaluationPriority: 5,
})) {
    const text = model.detokenize([token]);
    if ((response += text).indexOf("<dummy32000>") > 0)
        break;
}

Steps to reproduce

  1. Use mistral-7b-openorca.gguf2.Q4_0.gguf model downloaded from gpt4all website.
  2. Load and evaluate the model with the following options:
    
    const port = worker.parentPort!;
    if (worker.isMainThread)
    throw new Error("Invalid worker context");

const model = new LlamaModel({ llama: await getLlama({ cuda: true, build: "auto" }), useMmap: false, useMlock: false, modelPath: "./local/mistral-7b-openorca.Q4_0.gguf", gpuLayers: 32, });

const context = new LlamaContext({ model: model, seed: 0, threads: 4, sequences: 1, batchSize: 128, contextSize: 2048, });

const sequence = context.getSequence(); await sequence.clearHistory();

let response = "";

for await (const token of sequence.evaluate(model.tokenize(message, true), { topK: 40, topP: 0.4, temperature: 0.8, evaluationPriority: 5, })) { const text = model.detokenize([token]); // if ((response += text).indexOf("") > 0) // break;

port.postMessage(text);

}



### My Environment

| Dependency               | Version             |
| ---                      | ---                 |
| Operating System         | Linux               |
| CPU                      | AMD Ryzen 5 3600    |
| Node.js version          | 20.11.0             |
| Typescript version       | unknown             |
| `node-llama-cpp` version | 3.0.0-beta.11       |

### Additional Context

It worked before when using v2.8.7

### Relevant Features Used

- [ ] Metal support
- [X] CUDA support
- [ ] Grammar

### Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.
giladgd commented 4 months ago

@ruochenjia I found the issue and included the fix in #175

github-actions[bot] commented 4 months ago

:tada: This issue has been resolved in version 3.0.0-beta.13 :tada:

The release is available on:

Your semantic-release bot :package::rocket: