Compiling LLAMA for cuda is single threaded

withcatai / node-llama-cpp

Run AI models locally on your machine with node.js bindings for llama.cpp. Enforce a JSON schema on the model output on the generation level

MIT License

1.02k stars 94 forks source link

It takes a long time due to many template compilations for inference performance optimizations. Nothing can be done to shorten it other than removing support for some GGUF file formats (which is undesirable outside of the development of llama.cpp itself), and the compilation times mainly depend on your hardware. It will only get slower over time due to increasing support of new features and model architectures in llama.cpp.

I recommend you to switch to version 3 beta, which ships with prebuilt binaries that you can use without compiling anything.

withcatai / node-llama-cpp

Compiling LLAMA for cuda is single threaded #311