xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
3.49k stars 288 forks source link

Metal GPU support via llama-cpp? #931

Open slobentanzer opened 5 months ago

slobentanzer commented 5 months ago

Hi all, great work on the software, it works beautifully on my new M3. I don't know if I overlooked it, but is there support yet for using the GPUs of the Apple Silicon Metal driver, like in https://llama-cpp-python.readthedocs.io/en/latest/install/macos/?

I couldn't find docs about that, and did not manage to set up GPU usage via the client (always getting 'n_gpu must be > 0 and max the number of GPUs on the system (0)'). It would be great if I could use the GPUs via Xinference!

aresnow1 commented 5 months ago

If you launch a model with gguf format, it will use Metal automatically without specify n_gpu.

slobentanzer commented 5 months ago

Hi @aresnow1, thanks for the quick reply! It does use Metal, but does that automatically mean using the GPUs? I see heavy CPU usage when I do inference, and not a lot of GPU. The running Xinference process also responds with this when I try to set n_gpu higher than 0:

The parameter `n_gpu` must be greater than 0 and not greater than the number of GPUs: 0 on the machine.

So it seems there are no GPUs registered. Happy to help troubleshooting or file a PR if it is within my power.

aresnow1 commented 5 months ago

The n_gpu parameter is provided for NVIDIA graphics card users and does require some additional explanation. Regarding GPU utilization, what observations have you made when using llamacpp directly? In this code snippet at line 99 of the file here, we have set n_gpu_layers to 1 for Apple users.

slobentanzer commented 5 months ago

@aresnow1 thanks for the explanation, that makes sense! Still wondering about the monitored activity, but I will do some A/B testing and get back to you.

It is not super well-documented in cpp-llama, the version mentioned in the docs I quote above is quite old and back then, it was only implemented for 4-bit quantised models. It is hard to find out what has happened in the meantime in terms of model support. If you have a pointer, that would be great as well.

aresnow1 commented 5 months ago

@slobentanzer Oh, that reminds me, only 4-bit quantization can be accelerated with Metal, like Q4_K_M. What kind of quantization are you using?

slobentanzer commented 5 months ago

Ah, that clears it up. I am using a range of quantisations for benchmarking purposes. I can offer to do a PR for documenting this better, if you'd like. It would be nice to have this information in the docs, and maybe even programmatically. I have not been involved with this for long, but I intend to invest more time now, and could give feedback on usability on Apple Silicon (I have the biggest M3 machine).

aresnow1 commented 5 months ago

Any PR or feedback is welcome!