xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.7k stars 368 forks source link

QUESTION: multi-threaded generation #536

Closed qxpf666 closed 3 weeks ago

qxpf666 commented 10 months ago

like? xinference --threads 100 --threads N, -t N: Set the number of threads to use during generation.

UranusSeven commented 10 months ago

Hi!

You can set it using the parameter n_threads. Please note that multithreading can only be applied to models running with GGML backend.

By default, the number of thread will be set the half of your CPU count: max(multiprocessing.cpu_count() // 2, 1).

Here's an example:

from xinference.client import RESTfulClient

client = RESTfulClient("http://127.0.0.1:9997")

model_uid = client.launch_model(
    model_name="baichuan",
    model_format="ggmlv3",
    size_in_billions=7,
    n_threads=4,
)
model = client.get_model(model_uid)
print(model.generate("What is the largest animal in the world?", generate_config={"max_tokens": 128}))
github-actions[bot] commented 4 weeks ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been inactive for 5 days since being marked as stale.