Open slobentanzer opened 5 months ago
If you launch a model with gguf format, it will use Metal automatically without specify n_gpu
.
Hi @aresnow1, thanks for the quick reply! It does use Metal, but does that automatically mean using the GPUs? I see heavy CPU usage when I do inference, and not a lot of GPU. The running Xinference process also responds with this when I try to set n_gpu
higher than 0:
The parameter `n_gpu` must be greater than 0 and not greater than the number of GPUs: 0 on the machine.
So it seems there are no GPUs registered. Happy to help troubleshooting or file a PR if it is within my power.
The n_gpu
parameter is provided for NVIDIA graphics card users and does require some additional explanation. Regarding GPU utilization, what observations have you made when using llamacpp directly? In this code snippet at line 99 of the file here, we have set n_gpu_layers to 1 for Apple users.
@aresnow1 thanks for the explanation, that makes sense! Still wondering about the monitored activity, but I will do some A/B testing and get back to you.
It is not super well-documented in cpp-llama, the version mentioned in the docs I quote above is quite old and back then, it was only implemented for 4-bit quantised models. It is hard to find out what has happened in the meantime in terms of model support. If you have a pointer, that would be great as well.
@slobentanzer Oh, that reminds me, only 4-bit quantization can be accelerated with Metal, like Q4_K_M. What kind of quantization are you using?
Ah, that clears it up. I am using a range of quantisations for benchmarking purposes. I can offer to do a PR for documenting this better, if you'd like. It would be nice to have this information in the docs, and maybe even programmatically. I have not been involved with this for long, but I intend to invest more time now, and could give feedback on usability on Apple Silicon (I have the biggest M3 machine).
Any PR or feedback is welcome!
Hi all, great work on the software, it works beautifully on my new M3. I don't know if I overlooked it, but is there support yet for using the GPUs of the Apple Silicon Metal driver, like in https://llama-cpp-python.readthedocs.io/en/latest/install/macos/?
I couldn't find docs about that, and did not manage to set up GPU usage via the client (always getting 'n_gpu must be > 0 and max the number of GPUs on the system (0)'). It would be great if I could use the GPUs via Xinference!