QUESTION: multi-threaded generation

qxpf666 commented 10 months ago

like? xinference --threads 100 --threads N, -t N: Set the number of threads to use during generation.

UranusSeven commented 10 months ago

Hi!

You can set it using the parameter n_threads. Please note that multithreading can only be applied to models running with GGML backend.

By default, the number of thread will be set the half of your CPU count: max(multiprocessing.cpu_count() // 2, 1).

Here's an example:

from xinference.client import RESTfulClient

client = RESTfulClient("http://127.0.0.1:9997")

model_uid = client.launch_model(
    model_name="baichuan",
    model_format="ggmlv3",
    size_in_billions=7,
    n_threads=4,
)
model = client.get_model(model_uid)
print(model.generate("What is the largest animal in the world?", generate_config={"max_tokens": 128}))

github-actions[bot] commented 4 weeks ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been inactive for 5 days since being marked as stale.

xorbitsai / inference

QUESTION: multi-threaded generation #536