feat: automatically adapt to current free VRAM state

giladgd commented 6 months ago

feat: read tensor info from gguf files
feat: inspect gguf command
feat: inspect measure command
feat: readGgufFileInfo function
feat: GGUF file info on LlamaModel
feat: estimate VRAM usage of the model and context with certain options to adapt to current VRAM state and set great defaults for gpuLayers and contextSize. no manual configuration of those options is needed anymore to maximize performance
feat: JinjaTemplateChatWrapper
feat: use the tokenizer.chat_template header from the gguf file when available - use it to find a better specialized chat wrapper or use JinjaTemplateChatWrapper with it as a fallback
feat: improve resolveChatWrapper
feat: simplify generation CLI commands: chat, complete, infill
feat: read GPU device names
feat: get token type
refactor: gguf
test: separate gguf tests to model dependent and model independent tests
test: switch to new vitest test signature
fix: use the new llama.cpp CUDA flag
fix: improve chat wrappers tokenization
fix: bugs

Fixes #133

[x] Code is up-to-date with the master branch
[x] npm run format to apply eslint formatting
[x] npm run test passes with this change
[x] This pull request links relevant issues as Fixes #0000
[x] There are new or updated unit tests validating the change
[ ] Documentation has been updated to reflect this change
[x] The new commits and pull request title follow conventions explained in pull request guidelines (PRs that do not follow this convention will not be merged)

github-actions[bot] commented 6 months ago

:tada: This PR is included in version 3.0.0-beta.15 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

github-actions[bot] commented 1 week ago

:tada: This PR is included in version 3.0.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

withcatai / node-llama-cpp