BUG Not support multi card when load gguf format

rickywu commented 2 months ago

Describe the bug

qwen1_5-14b-chat-q8_0.gguf need about 17G gpu mem, I have two T4, each of them has 16G gpu mem when launch got failed to create llama context If change to q4 which need about 8G gpu mem, when works well

To Reproduce

register model and launch qwen1_5-14b-chat-q4_0.gguf

docker image xinference:v0.11.2

Expected behavior

Launch without error

Additional context

Add any other context about the problem here.

qinxuye commented 2 months ago

For now, you can try command line, xinference launch qwen1.5-chat xxx --n-gpu 2.

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 7 days with no activity.

skytodmoon commented 4 days ago

mark:I have 8 3090 24G, But I use the xinference ui to deploy failed. Server error: 503 - [address=0.0.0.0:38545, pid=1080339] CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 23.68 GiB of which 14.81 MiB is free. Process 1077237 has 9.44 GiB memory in use. Process 1077362 has 9.81 GiB memory in use. Including non-PyTorch memory, this process has 4.40 GiB memory in use. Of the allocated memory 3.73 GiB is allocated by PyTorch, and 269.81 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

xorbitsai / inference