Free VRAM programmatically instead with GC

IfnotFr commented 1 month ago

Feature Description

Actually LlamaContext and LlamaChatSession VRAM unallocation is done by the GC when we unset the variable containing them. But relying on the GC for freeing the VRAM can be complicated if we want to programatically handle multiple context/sessions.

For example in my application I need to make inferences to multiple chat contexts (differents prompts, histories ...). Actually I fork a worker.js everytime and rely on the child process kill to free the VRAM. But it is slow and cumberstone ...

Code example filling the VRAM

... because the garbage collector does not have the time to free the VRAM when context / session vars are replaced. We can have a workaround by exposing node GC and run manually but it depends of the environment (and in my case not possible).

// ...
let llama = await getLlama()
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
})

let context
let session
let i = 0
while (true) {
  i++
  context = await model.createContext()
  session = new LlamaChatSession({
    contextSequence: context.getSequence()
  })

  const q1 = 'Hi there, how are you?'
  console.log(`${i} User: ${q1}`)

  const a1 = await session.prompt(q1)
  console.log(`${i} AI: ${a1}`)

  const q2 = 'Summarize what you said'
  console.log(`${i} User: ${q2}`)

  const a2 = await session.prompt(q2)
  console.log(`${i} AI: ${a2}`)
}

Additionnal note, if we sleep for like 10 seconds between each loop, the GC have the time to free the vram. But also, it is not a really nice solution.

The Solution

Maybe having something like a LlamaContext.unload() or LlamaChatSession.unload(), letting us free the VRAM for another context/session ?

Considered Alternatives

I don't have any other solution to have a method for unloading the VRAM directly from the objects instead of relying on the node GC.

Additional Context

I have read some related problems in the python wrapper side.

Maybe it can be helpful ?

https://github.com/abetlen/llama-cpp-python/issues/223

Related Features to This Feature Request

[ ] Metal support
[ ] CUDA support
[ ] Grammar

Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time, but I can support (using donations) development.

giladgd commented 1 month ago

There's already a .dispose() function available on all the objects that you can use:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
console.log("VRAM usage", (await llama.getVramState()).used);

await context.dispose(); // dispose the context
console.log("VRAM usage", (await llama.getVramState()).used);

await model.dispose(); // dispose the model and all of its contexts
console.log("VRAM usage", (await llama.getVramState()).used);

You can also use await using to automatically dispose things when they become out of scope:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
{
    await using model = await llama.loadModel({
        modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
    });
    console.log("VRAM usage", (await llama.getVramState()).used);
}

// the model will be automatically disposed when this line is reached
console.log("VRAM usage", (await llama.getVramState()).used);

IfnotFr commented 1 month ago

Damn, I tried with dispose with no luck. I may did something wrong.

Thank you for this rapid answer, sorry for the dumb question.

Hope it will at least help people with same problem as me.

withcatai / node-llama-cpp