Closed IfnotFr closed 1 month ago
There's already a .dispose()
function available on all the objects that you can use:
import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";
const __dirname = path.dirname(fileURLToPath(import.meta.url));
const llama = await getLlama();
const model = await llama.loadModel({
modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
console.log("VRAM usage", (await llama.getVramState()).used);
await context.dispose(); // dispose the context
console.log("VRAM usage", (await llama.getVramState()).used);
await model.dispose(); // dispose the model and all of its contexts
console.log("VRAM usage", (await llama.getVramState()).used);
You can also use await using
to automatically dispose things when they become out of scope:
import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";
const __dirname = path.dirname(fileURLToPath(import.meta.url));
const llama = await getLlama();
{
await using model = await llama.loadModel({
modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
console.log("VRAM usage", (await llama.getVramState()).used);
}
// the model will be automatically disposed when this line is reached
console.log("VRAM usage", (await llama.getVramState()).used);
Damn, I tried with dispose with no luck. I may did something wrong.
Thank you for this rapid answer, sorry for the dumb question.
Hope it will at least help people with same problem as me.
Feature Description
Actually
LlamaContext
andLlamaChatSession
VRAM unallocation is done by the GC when we unset the variable containing them. But relying on the GC for freeing the VRAM can be complicated if we want to programatically handle multiple context/sessions.For example in my application I need to make inferences to multiple chat contexts (differents prompts, histories ...). Actually I fork a
worker.js
everytime and rely on the child process kill to free the VRAM. But it is slow and cumberstone ...Code example filling the VRAM
... because the garbage collector does not have the time to free the VRAM when
context
/session
vars are replaced. We can have a workaround by exposing node GC and run manually but it depends of the environment (and in my case not possible).Additionnal note, if we sleep for like 10 seconds between each loop, the GC have the time to free the vram. But also, it is not a really nice solution.
The Solution
Maybe having something like a
LlamaContext.unload()
orLlamaChatSession.unload()
, letting us free the VRAM for another context/session ?Considered Alternatives
I don't have any other solution to have a method for unloading the VRAM directly from the objects instead of relying on the node GC.
Additional Context
I have read some related problems in the python wrapper side.
Maybe it can be helpful ?
https://github.com/abetlen/llama-cpp-python/issues/223
Related Features to This Feature Request
Are you willing to resolve this issue by submitting a Pull Request?
No, I don’t have the time, but I can support (using donations) development.