Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
[gpu=0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=24.50 GB
[gpu=0] Memory pool end. avail mem=4.38 GB
[gpu=0] Capture cuda graph begin. This can take up to several minutes.
[gpu=0] max_total_num_tokens=40602, max_prefill_tokens=16384, max_running_requests=2047, context_len=32768
INFO: Started server process [11198]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
Exception in thread Thread-1 (_wait_and_warmup):
Traceback (most recent call last):
File "/xxx/anaconda3/envs/xinference2/lib/python3.11/site-packages/requests/models.py", line 974, in json
return complexjson.loads(self.text, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/xxx/anaconda3/envs/xinference2/lib/python3.11/json/init.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/xxx/anaconda3/envs/xinference2/lib/python3.11/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/xxx/anaconda3/envs/xinference2/lib/python3.11/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/xxx/anaconda3/envs/xinference2/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/xxx/anaconda3/envs/xinference2/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/xxx/anaconda3/envs/xinference2/lib/python3.11/site-packages/sglang/srt/server.py", line 410, in _wait_and_warmup
model_info = res.json()
^^^^^^^^^^
File "/xxx/anaconda3/envs/xinference2/lib/python3.11/site-packages/requests/models.py", line 978, in json
raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
[gpu=0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=24.50 GB
[gpu=0] Memory pool end. avail mem=4.38 GB
[gpu=0] Capture cuda graph begin. This can take up to several minutes.
[gpu=0] max_total_num_tokens=40602, max_prefill_tokens=16384, max_running_requests=2047, context_len=32768
INFO: Started server process [5491]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
System Info / 系統信息
Cuda 12.1 Pytorch 2.4.0 vllm 0.5.4 sglang 0.2.13 flashinfer 0.1.5+cu121torch2.4
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
Version info / 版本信息
Xinference 0.14.2
The command used to start Xinference / 用以启动 xinference 的命令
export XINFERENCE_MODEL_SRC=modelscope xinference-worker -e http://172.22.xxx.xx:9997 -H 172.22.xxx.xx --worker-port 9996 --log-level DEBUG
Reproduction / 复现过程
xinference launch --model-name qwen1.5-chat --size-in-billions 7 --model-format pytorch --quantization none -e http://172.22.xxx.xx:9997 --model-uid qwen-chat-7b --model-engine sglang --worker-ip 172.22.xxx.xx --gpu-idx 1
启动后信息为 2024-08-20 15:30:17,353 xinference.core.worker 9939 DEBUG Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f1a9535be90>,), kwargs: {'model_uid': 'qwen-chat-7b-1-0', 'model_name': 'qwen1.5-chat', 'model_size_in_billions': 7, 'model_format': 'pytorch', 'quantization': 'none', 'model_engine': 'sglang', 'model_type': 'LLM', 'n_gpu': 'auto', 'request_limits': None, 'peft_model_config': None, 'gpu_idx': [1], 'download_hub': None, 'trust_remote_code': True} 2024-08-20 15:30:17,356 xinference.core.worker 9939 INFO You specify to launch the model: qwen1.5-chat on GPU index: [1] of the worker: 172.22.227.21:9996, xinference will automatically ignore the
n_gpu
option. 2024-08-20 15:30:21,891 xinference.model.llm.core 9939 DEBUG Launching qwen-chat-7b-1-0 with SGLANGChatModel 2024-08-20 15:30:21,891 xinference.model.llm.llm_family 9939 INFO Caching from Modelscope: qwen/Qwen1.5-7B-Chat 2024-08-20 15:30:22,234 xinference.model.llm.llm_family 9939 INFO Cache /xxx/.xinference/cache/qwen1_5-chat-pytorch-7b exists 2024-08-20 15:30:22,268 xinference.model.llm.sglang.core 11021 INFO Loading qwen-chat-7b with following model config: {'trust_remote_code': True, 'tokenizer_mode': 'auto', 'tp_size': 1, 'mem_fraction_static': 0.88, 'log_level': 'info', 'attention_reduce_in_fp32': False} server_args=ServerArgs(model_path='/xxx/.xinference/cache/qwen1_5-chat-pytorch-7b', tokenizer_path='/xxx/.xinference/cache/qwen1_5-chat-pytorch-7b', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', trust_remote_code=True, context_length=None, quantization=None, served_model_name='/xxx/.xinference/cache/qwen1_5-chat-pytorch-7b', chat_template=None, host='127.0.0.1', port=30000, additional_ports=[30001, 30002, 30003, 30004], mem_fraction_static=0.88, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=588843165, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_disk_cache=False, enable_torch_compile=False, enable_p2p_check=False, enable_mla=False, attention_reduce_in_fp32=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None) [gpu=0] Init nccl begin. [gpu=0] Load weight begin. avail mem=38.98 GB Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.31it/s] Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.13it/s] Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.21it/s] Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.14it/s] Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.16it/s][gpu=0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=24.50 GB [gpu=0] Memory pool end. avail mem=4.38 GB [gpu=0] Capture cuda graph begin. This can take up to several minutes. [gpu=0] max_total_num_tokens=40602, max_prefill_tokens=16384, max_running_requests=2047, context_len=32768 INFO: Started server process [11198] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit) Exception in thread Thread-1 (_wait_and_warmup): Traceback (most recent call last): File "/xxx/anaconda3/envs/xinference2/lib/python3.11/site-packages/requests/models.py", line 974, in json return complexjson.loads(self.text, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/xxx/anaconda3/envs/xinference2/lib/python3.11/json/init.py", line 346, in loads return _default_decoder.decode(s) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/xxx/anaconda3/envs/xinference2/lib/python3.11/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/xxx/anaconda3/envs/xinference2/lib/python3.11/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/xxx/anaconda3/envs/xinference2/lib/python3.11/threading.py", line 1045, in _bootstrap_inner self.run() File "/xxx/anaconda3/envs/xinference2/lib/python3.11/threading.py", line 982, in run self._target(*self._args, **self._kwargs) File "/xxx/anaconda3/envs/xinference2/lib/python3.11/site-packages/sglang/srt/server.py", line 410, in _wait_and_warmup model_info = res.json() ^^^^^^^^^^ File "/xxx/anaconda3/envs/xinference2/lib/python3.11/site-packages/requests/models.py", line 978, in json raise RequestsJSONDecodeError(e.msg, e.doc, e.pos) requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
单独启动sglang启动成功,且调用不会报错 sglang启动命令: python -m sglang.launch_server --model-path /xxx/.cache/modelscope/hub/qwen/Qwen1_5-7B-Chat --host 0.0.0.0 --port 30000 启动后: server_args=ServerArgs(modelpath='/xxx/.cache/modelscope/hub/qwen/Qwen15-7B-Chat', tokenizer_path='/xxx/.cache/modelscope/hub/qwen/Qwen1_5-7B-Chat', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', trust_remote_code=False, context_length=None, quantization=None, served_modelname='/xxx/.cache/modelscope/hub/qwen/Qwen15-7B-Chat', chat_template=None, host='0.0.0.0', port=30000, additional_ports=[30001, 30002, 30003, 30004], mem_fraction_static=0.88, max_running_requests=None, max_num_reqs=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=914589571, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_disk_cache=False, enable_torch_compile=False, enable_p2p_check=False, enable_mla=False, attention_reduce_in_fp32=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None) [gpu=0] Init nccl begin. [gpu=0] Load weight begin. avail mem=38.98 GB Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:01<00:03, 1.05s/it] Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:02<00:02, 1.17s/it] Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:03<00:01, 1.19s/it] Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00, 1.23s/it] Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00, 1.20s/it]
[gpu=0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=24.50 GB [gpu=0] Memory pool end. avail mem=4.38 GB [gpu=0] Capture cuda graph begin. This can take up to several minutes. [gpu=0] max_total_num_tokens=40602, max_prefill_tokens=16384, max_running_requests=2047, context_len=32768 INFO: Started server process [5491] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
Expected behavior / 期待表现
sglang启动可以在xinference中使用