vllm推理速度慢， GPU KV cache usage: 0.1%

System Info / 系統信息

ubuntu 20.04 python 3.10

➜  ~ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

➜  ~ nvidia-smi
Wed Sep  4 15:22:35 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                        On | 00000000:00:0D.0 Off |                    0 |
| N/A   40C    P8                9W /  70W|      9MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1542      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

相关的其他pip依赖都是通过下述三个命令安装的 pip install "xinference[vllm] pip install "xinference[transformers] pip install sentence-transformers

transformers 4.43.4 transformers-stream-generator 0.0.5 vllm 0.5.5 vllm-flash-attn 2.6.1 xinference 0.14.4.post1

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[ ] docker / docker
[X] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

xinference 0.14.4.post1

The command used to start Xinference / 用以启动 xinference 的命令

nohup xinference-local --host 0.0.0.0 --port 49640 --log-level DEBUG &

还用cpu模式，启动着bge-large-zh-v1.5和bge-reranker-v2-m3模型

Reproduction / 复现过程

qwen2-instruct-gptq-7b-Int4 一直用的是vllm推理框架，gptq int4。大部分时间是正常运行的。如果推理速度正常，这个机器的表现是每秒平均token大约在25左右。

偶发性启动模型后，推理速度表现一直是异常的，推理速度特别慢，每秒平均token大约在0.3左右，且GPU的kv占用只有0.1%，GPU完全没有利用起来；此时如果再用transform框架启动pytorch版本的模型，推理速度大约在5左右。

Wed Sep  4 15:29:53 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                        On | 00000000:00:0D.0 Off |                    0 |
| N/A   49C    P0               28W /  70W|  12869MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1542      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A   4009680      C   ...s/xinference0.14.4.post1/bin/python    12846MiB |
+---------------------------------------------------------------------------------------+

2024-09-04 15:02:08,756 xinference.core.supervisor 3657471 DEBUG    Enter list_model_registrations, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fd753553290>, 'LLM'), kwargs: {'detailed': True}
2024-09-04 15:02:08,760 uvicorn.access 3657230 INFO     117.136.44.179:40412 - "GET /v1/cluster/devices HTTP/1.1" 200
2024-09-04 15:02:08,861 xinference.core.supervisor 3657471 DEBUG    Leave list_model_registrations, elapsed time: 0 s
2024-09-04 15:02:08,871 uvicorn.access 3657230 INFO     117.136.44.179:40411 - "GET /v1/model_registrations/LLM?detailed=true HTTP/1.1" 200
2024-09-04 15:02:12,202 xinference.core.supervisor 3657471 DEBUG    Enter query_engines_by_model_name, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fd753553290>, 'qwen2-instruct'), kwargs: {}
2024-09-04 15:02:12,202 xinference.core.worker 3657471 DEBUG    Enter query_engines_by_model_name, args: (<xinference.core.worker.WorkerActor object at 0x7fd75c783ab0>, 'qwen2-instruct'), kwargs: {}
2024-09-04 15:02:12,203 xinference.core.worker 3657471 DEBUG    Leave query_engines_by_model_name, elapsed time: 0 s
2024-09-04 15:02:12,203 xinference.core.supervisor 3657471 DEBUG    Leave query_engines_by_model_name, elapsed time: 0 s
2024-09-04 15:02:12,204 uvicorn.access 3657230 INFO     117.136.44.179:40413 - "GET /v1/engines/qwen2-instruct HTTP/1.1" 200
2024-09-04 15:02:33,653 xinference.core.supervisor 3657471 DEBUG    Enter launch_builtin_model, model_uid: qwen2:7b, model_name: qwen2-instruct, model_size: 7, model_format: gptq, quantization: Int4, replica: 1, kwargs: {}
2024-09-04 15:02:33,654 xinference.core.worker 3657471 DEBUG    Enter get_model_count, args: (<xinference.core.worker.WorkerActor object at 0x7fd75c783ab0>,), kwargs: {}
2024-09-04 15:02:33,654 xinference.core.worker 3657471 DEBUG    Leave get_model_count, elapsed time: 0 s
2024-09-04 15:02:33,654 xinference.core.worker 3657471 DEBUG    Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7fd75c783ab0>,), kwargs: {'model_uid': 'qwen2:7b-1-0', 'model_name': 'qwen2-instruct', 'model_size_in_billions': 7, 'model_format': 'gptq', 'quantization': 'Int4', 'model_engine': 'vLLM', 'model_type': 'LLM', 'n_gpu': 1, 'request_limits': None, 'peft_model_config': None, 'gpu_idx': None, 'download_hub': 'modelscope', 'model_path': None}
2024-09-04 15:02:33,654 xinference.core.worker 3657471 DEBUG    GPU selected: [0] for model qwen2:7b-1-0
2024-09-04 15:02:41,707 xinference.model.llm.core 3657471 DEBUG    Launching qwen2:7b-1-0 with VLLMChatModel
2024-09-04 15:02:41,707 xinference.model.llm.llm_family 3657471 INFO     Caching from Modelscope: qwen/Qwen2-7B-Instruct-GPTQ-Int4
2024-09-04 15:02:41,730 xinference.model.llm.vllm.core 3980157 INFO     Loading qwen2:7b with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 1, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}Enable lora: False. Lora count: 0.
2024-09-04 15:02:41,734 transformers.configuration_utils 3980157 INFO     loading configuration file /root/.xinference/cache/qwen2-instruct-gptq-7b-Int4/config.json
2024-09-04 15:02:41,735 transformers.configuration_utils 3980157 INFO     Model config Qwen2Config {
  "_name_or_path": "/root/.xinference/cache/qwen2-instruct-gptq-7b-Int4",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "quantization_config": {
    "batch_size": 1,
    "bits": 4,
    "block_name_to_quantize": null,
    "cache_block_outputs": true,
    "damp_percent": 0.1,
    "dataset": null,
    "desc_act": false,
    "exllama_config": {
      "version": 1
    },
    "group_size": 128,
    "max_input_length": null,
    "model_seqlen": null,
    "module_name_preceding_first_block": null,
    "modules_in_block_to_quantize": null,
    "pad_token_id": null,
    "quant_method": "gptq",
    "sym": true,
    "tokenizer": null,
    "true_sequential": true,
    "use_cuda_fp16": false,
    "use_exllama": true
  },
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.43.4",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 152064
}

2024-09-04 15:02:41,736 transformers.models.auto.image_processing_auto 3980157 INFO     Could not locate the image processor configuration file, will try to use the model config instead.
2024-09-04 15:02:41,751 vllm.config  3980157 WARNING  gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-09-04 15:02:41,753 vllm.engine.llm_engine 3980157 INFO     Initializing an LLM engine (v0.5.5) with config: model='/root/.xinference/cache/qwen2-instruct-gptq-7b-Int4', speculative_config=None, tokenizer='/root/.xinference/cache/qwen2-instruct-gptq-7b-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/root/.xinference/cache/qwen2-instruct-gptq-7b-Int4, use_v2_block_manager=False, enable_prefix_caching=False)
2024-09-04 15:02:41,764 transformers.tokenization_utils_base 3980157 INFO     loading file vocab.json
2024-09-04 15:02:41,764 transformers.tokenization_utils_base 3980157 INFO     loading file merges.txt
2024-09-04 15:02:41,764 transformers.tokenization_utils_base 3980157 INFO     loading file tokenizer.json
2024-09-04 15:02:41,764 transformers.tokenization_utils_base 3980157 INFO     loading file added_tokens.json
2024-09-04 15:02:41,764 transformers.tokenization_utils_base 3980157 INFO     loading file special_tokens_map.json
2024-09-04 15:02:41,764 transformers.tokenization_utils_base 3980157 INFO     loading file tokenizer_config.json
2024-09-04 15:02:42,019 transformers.tokenization_utils_base 3980157 INFO     Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-09-04 15:02:42,041 transformers.generation.configuration_utils 3980157 INFO     loading configuration file /root/.xinference/cache/qwen2-instruct-gptq-7b-Int4/generation_config.json
2024-09-04 15:02:42,042 transformers.generation.configuration_utils 3980157 INFO     Generate config GenerationConfig {
  "bos_token_id": 151643,
  "do_sample": true,
  "eos_token_id": [
    151645,
    151643
  ],
  "pad_token_id": 151643,
  "repetition_penalty": 1.05,
  "temperature": 0.7,
  "top_k": 20,
  "top_p": 0.8
}

2024-09-04 15:02:42,125 vllm.attention.selector 3980157 INFO     Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
2024-09-04 15:02:42,125 vllm.attention.selector 3980157 INFO     Using XFormers backend.
2024-09-04 15:02:45,315 vllm.distributed.parallel_state 3980157 DEBUG    world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.0.149:57053 backend=nccl
2024-09-04 15:02:45,396 vllm.worker.model_runner 3980157 INFO     Starting to load model /root/.xinference/cache/qwen2-instruct-gptq-7b-Int4...
2024-09-04 15:02:45,423 vllm.attention.selector 3980157 INFO     Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
2024-09-04 15:02:45,423 vllm.attention.selector 3980157 INFO     Using XFormers backend.
2024-09-04 15:03:07,027 vllm.worker.model_runner 3980157 INFO     Loading model weights took 5.2035 GB
2024-09-04 15:03:10,265 vllm.executor.gpu_executor 3980157 INFO     # GPU blocks: 5924, # CPU blocks: 4681
2024-09-04 15:03:13,679 vllm.worker.model_runner 3980157 INFO     Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-09-04 15:03:13,680 vllm.worker.model_runner 3980157 INFO     CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
2024-09-04 15:03:37,337 vllm.worker.model_runner 3980157 INFO     Graph capturing finished in 24 secs.
2024-09-04 15:03:37,375 xinference.model.llm.vllm.core 3980157 DEBUG    Begin to check health of vLLM
2024-09-04 15:03:37,376 vllm.engine.async_llm_engine 3980157 DEBUG    Starting health check...
2024-09-04 15:03:37,376 vllm.engine.async_llm_engine 3980157 DEBUG    Health check took 0.000090s
2024-09-04 15:03:37,376 xinference.core.worker 3657471 DEBUG    Leave launch_builtin_model, elapsed time: 63 s
2024-09-04 15:03:37,377 uvicorn.access 3657230 INFO     117.136.44.179:40414 - "POST /v1/models HTTP/1.1" 200
2024-09-04 15:03:37,514 xinference.core.supervisor 3657471 DEBUG    Enter list_models, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fd753553290>,), kwargs: {}
2024-09-04 15:03:37,514 xinference.core.worker 3657471 DEBUG    Enter list_models, args: (<xinference.core.worker.WorkerActor object at 0x7fd75c783ab0>,), kwargs: {}
2024-09-04 15:03:37,514 xinference.core.worker 3657471 DEBUG    Leave list_models, elapsed time: 0 s
2024-09-04 15:03:37,514 xinference.core.supervisor 3657471 DEBUG    Leave list_models, elapsed time: 0 s
2024-09-04 15:03:37,515 uvicorn.access 3657230 INFO     117.136.44.179:40415 - "GET /v1/models HTTP/1.1" 200
2024-09-04 15:04:07,377 vllm.engine.async_llm_engine 3980157 DEBUG    Starting health check...
2024-09-04 15:04:07,377 vllm.engine.async_llm_engine 3980157 DEBUG    Health check took 0.000223s
2024-09-04 15:04:26,391 uvicorn.access 3657230 INFO     117.136.44.179:40416 - "HEAD /qwen2%3A7b HTTP/1.1" 404
2024-09-04 15:04:26,922 uvicorn.access 3657230 INFO     117.136.44.179:40417 - "POST /v1/ui/qwen2%3A7b HTTP/1.1" 200
2024-09-04 15:04:27,160 uvicorn.access 3657230 INFO     117.136.44.179:40418 - "GET /qwen2%3A7b HTTP/1.1" 307
2024-09-04 15:04:27,172 xinference.core.supervisor 3657471 DEBUG    Enter list_models, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fd753553290>,), kwargs: {}
2024-09-04 15:04:27,172 xinference.core.worker 3657471 DEBUG    Enter list_models, args: (<xinference.core.worker.WorkerActor object at 0x7fd75c783ab0>,), kwargs: {}
2024-09-04 15:04:27,172 xinference.core.worker 3657471 DEBUG    Leave list_models, elapsed time: 0 s
2024-09-04 15:04:27,172 xinference.core.supervisor 3657471 DEBUG    Leave list_models, elapsed time: 0 s
2024-09-04 15:04:27,173 uvicorn.access 3657230 INFO     117.136.44.179:40419 - "GET /v1/models HTTP/1.1" 200
2024-09-04 15:04:27,283 uvicorn.access 3657230 INFO     117.136.44.179:40420 - "GET /qwen2%3A7b/ HTTP/1.1" 200
2024-09-04 15:04:27,492 uvicorn.access 3657230 INFO     117.136.44.179:40421 - "GET /assets/index-1ebe9c14.css HTTP/1.1" 404
2024-09-04 15:04:27,600 uvicorn.access 3657230 INFO     117.136.44.179:40422 - "GET /qwen2%3A7b/info HTTP/1.1" 200
2024-09-04 15:04:27,707 uvicorn.access 3657230 INFO     117.136.44.179:40423 - "GET /qwen2%3A7b/theme.css HTTP/1.1" 200
2024-09-04 15:04:27,709 uvicorn.access 3657230 INFO     117.136.44.179:40424 - "GET /qwen2%3A7b/heartbeat/9e23nax0pmk HTTP/1.1" 200
2024-09-04 15:04:33,729 uvicorn.access 3657230 INFO     117.136.44.179:40425 - "POST /qwen2%3A7b/run/predict HTTP/1.1" 200
2024-09-04 15:04:34,097 uvicorn.access 3657230 INFO     117.136.44.179:41883 - "POST /qwen2%3A7b/run/predict HTTP/1.1" 200
2024-09-04 15:04:34,440 uvicorn.access 3657230 INFO     117.136.44.179:41884 - "POST /qwen2%3A7b/run/predict HTTP/1.1" 200
2024-09-04 15:04:34,564 uvicorn.access 3657230 INFO     117.136.44.179:41885 - "POST /qwen2%3A7b/queue/join HTTP/1.1" 200
2024-09-04 15:04:34,593 uvicorn.access 3657230 INFO     127.0.0.1:40702 - "GET /v1/cluster/auth HTTP/1.1" 200
2024-09-04 15:04:34,596 xinference.core.supervisor 3657471 DEBUG    Enter describe_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fd753553290>, 'qwen2:7b'), kwargs: {}
2024-09-04 15:04:34,596 xinference.core.worker 3657471 DEBUG    Enter describe_model, args: (<xinference.core.worker.WorkerActor object at 0x7fd75c783ab0>,), kwargs: {'model_uid': 'qwen2:7b-1-0'}
2024-09-04 15:04:34,596 xinference.core.worker 3657471 DEBUG    Leave describe_model, elapsed time: 0 s
2024-09-04 15:04:34,597 xinference.core.supervisor 3657471 DEBUG    Leave describe_model, elapsed time: 0 s
2024-09-04 15:04:34,597 uvicorn.access 3657230 INFO     127.0.0.1:40718 - "GET /v1/models/qwen2%3A7b HTTP/1.1" 200
2024-09-04 15:04:34,599 xinference.core.supervisor 3657471 DEBUG    Enter get_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fd753553290>, 'qwen2:7b'), kwargs: {}
2024-09-04 15:04:34,600 xinference.core.worker 3657471 DEBUG    Enter get_model, args: (<xinference.core.worker.WorkerActor object at 0x7fd75c783ab0>,), kwargs: {'model_uid': 'qwen2:7b-1-0'}
2024-09-04 15:04:34,600 xinference.core.worker 3657471 DEBUG    Leave get_model, elapsed time: 0 s
2024-09-04 15:04:34,600 xinference.core.supervisor 3657471 DEBUG    Leave get_model, elapsed time: 0 s
2024-09-04 15:04:34,600 xinference.core.supervisor 3657471 DEBUG    Enter describe_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fd753553290>, 'qwen2:7b'), kwargs: {}
2024-09-04 15:04:34,600 xinference.core.worker 3657471 DEBUG    Enter describe_model, args: (<xinference.core.worker.WorkerActor object at 0x7fd75c783ab0>,), kwargs: {'model_uid': 'qwen2:7b-1-0'}
2024-09-04 15:04:34,601 xinference.core.worker 3657471 DEBUG    Leave describe_model, elapsed time: 0 s
2024-09-04 15:04:34,601 xinference.core.supervisor 3657471 DEBUG    Leave describe_model, elapsed time: 0 s
2024-09-04 15:04:34,601 uvicorn.access 3657230 INFO     127.0.0.1:40720 - "POST /v1/chat/completions HTTP/1.1" 200
2024-09-04 15:04:34,602 xinference.core.model 3980157 DEBUG    Enter wrapped_func, args: (<xinference.core.model.ModelActor object at 0x7fe06dbb3010>, 'hi', '', [], {'max_tokens': 512, 'temperature': 1.0, 'stream': True, 'lora_name': ''}), kwargs: {'raw_params': {'max_tokens': 512, 'temperature': 1, 'stream': True, 'lora_name': ''}}
2024-09-04 15:04:34,602 xinference.core.model 3980157 DEBUG    Request chat, current serve request count: 0, request limit: None for the model qwen2:7b
2024-09-04 15:04:34,603 xinference.model.llm.vllm.core 3980157 DEBUG    Enter generate, prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
hi<|im_end|>
<|im_start|>assistant
, generate config: {'max_tokens': 512, 'temperature': 1.0, 'stream': True, 'lora_name': '', 'stop': ['<|endoftext|>', '<|im_start|>', '<|im_end|>'], 'stop_token_ids': [151643, 151644, 151645]}
2024-09-04 15:04:34,603 xinference.core.model 3980157 DEBUG    After request chat, current serve request count: 0 for the model qwen2:7b
2024-09-04 15:04:34,603 xinference.core.model 3980157 DEBUG    Leave wrapped_func, elapsed time: 0 s
2024-09-04 15:04:34,604 vllm.engine.async_llm_engine 3980157 INFO     Added request f0e0b76c-6a8b-11ef-9f2a-fa163e429a2f.
2024-09-04 15:04:34,604 vllm.engine.async_llm_engine 3980157 DEBUG    Waiting for new requests...
2024-09-04 15:04:34,605 vllm.engine.async_llm_engine 3980157 DEBUG    Got new requests!
2024-09-04 15:04:34,668 uvicorn.access 3657230 INFO     117.136.44.179:41886 - "GET /qwen2%3A7b/queue/data?session_hash=9e23nax0pmk HTTP/1.1" 200
2024-09-04 15:04:34,725 vllm.engine.metrics 3980157 INFO     Avg prompt throughput: 0.3 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
2024-09-04 15:04:34,922 vllm.engine.async_llm_engine 3980157 INFO     Finished request f0e0b76c-6a8b-11ef-9f2a-fa163e429a2f.
2024-09-04 15:04:34,922 vllm.engine.async_llm_engine 3980157 DEBUG    Waiting for new requests...
2024-09-04 15:04:37,378 vllm.engine.async_llm_engine 3980157 DEBUG    Starting health check...
2024-09-04 15:04:37,378 vllm.engine.async_llm_engine 3980157 DEBUG    Health check took 0.000199s
2024-09-04 15:04:39,564 uvicorn.access 3657230 INFO     117.136.44.179:41887 - "POST /qwen2%3A7b/run/predict HTTP/1.1" 200
2024-09-04 15:04:46,693 uvicorn.access 3657230 INFO     117.136.44.179:41889 - "POST /qwen2%3A7b/run/predict HTTP/1.1" 200
2024-09-04 15:04:47,060 uvicorn.access 3657230 INFO     117.136.44.179:41888 - "POST /qwen2%3A7b/run/predict HTTP/1.1" 200
2024-09-04 15:04:47,528 uvicorn.access 3657230 INFO     117.136.44.179:41890 - "POST /qwen2%3A7b/run/predict HTTP/1.1" 200
2024-09-04 15:04:47,653 uvicorn.access 3657230 INFO     117.136.44.179:41891 - "POST /qwen2%3A7b/queue/join HTTP/1.1" 200
2024-09-04 15:04:47,681 uvicorn.access 3657230 INFO     127.0.0.1:34488 - "GET /v1/cluster/auth HTTP/1.1" 200
2024-09-04 15:04:47,684 xinference.core.supervisor 3657471 DEBUG    Enter describe_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fd753553290>, 'qwen2:7b'), kwargs: {}
2024-09-04 15:04:47,684 xinference.core.worker 3657471 DEBUG    Enter describe_model, args: (<xinference.core.worker.WorkerActor object at 0x7fd75c783ab0>,), kwargs: {'model_uid': 'qwen2:7b-1-0'}
2024-09-04 15:04:47,684 xinference.core.worker 3657471 DEBUG    Leave describe_model, elapsed time: 0 s
2024-09-04 15:04:47,684 xinference.core.supervisor 3657471 DEBUG    Leave describe_model, elapsed time: 0 s
2024-09-04 15:04:47,684 uvicorn.access 3657230 INFO     127.0.0.1:34494 - "GET /v1/models/qwen2%3A7b HTTP/1.1" 200
2024-09-04 15:04:47,687 xinference.core.supervisor 3657471 DEBUG    Enter get_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fd753553290>, 'qwen2:7b'), kwargs: {}
2024-09-04 15:04:47,687 xinference.core.worker 3657471 DEBUG    Enter get_model, args: (<xinference.core.worker.WorkerActor object at 0x7fd75c783ab0>,), kwargs: {'model_uid': 'qwen2:7b-1-0'}
2024-09-04 15:04:47,687 xinference.core.worker 3657471 DEBUG    Leave get_model, elapsed time: 0 s
2024-09-04 15:04:47,687 xinference.core.supervisor 3657471 DEBUG    Leave get_model, elapsed time: 0 s
2024-09-04 15:04:47,688 xinference.core.supervisor 3657471 DEBUG    Enter describe_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fd753553290>, 'qwen2:7b'), kwargs: {}
2024-09-04 15:04:47,688 xinference.core.worker 3657471 DEBUG    Enter describe_model, args: (<xinference.core.worker.WorkerActor object at 0x7fd75c783ab0>,), kwargs: {'model_uid': 'qwen2:7b-1-0'}
2024-09-04 15:04:47,688 xinference.core.worker 3657471 DEBUG    Leave describe_model, elapsed time: 0 s
2024-09-04 15:04:47,688 xinference.core.supervisor 3657471 DEBUG    Leave describe_model, elapsed time: 0 s
2024-09-04 15:04:47,689 uvicorn.access 3657230 INFO     127.0.0.1:34500 - "POST /v1/chat/completions HTTP/1.1" 200
2024-09-04 15:04:47,689 xinference.core.model 3980157 DEBUG    Enter wrapped_func, args: (<xinference.core.model.ModelActor object at 0x7fe06dbb3010>, '你是谁', '', [{'role': 'user', 'content': 'hi'}, {'role': 'assistant', 'content': 'Hello! How can I assist you today?'}], {'max_tokens': 512, 'temperature': 1.0, 'stream': True, 'lora_name': ''}), kwargs: {'raw_params': {'max_tokens': 512, 'temperature': 1, 'stream': True, 'lora_name': ''}}
2024-09-04 15:04:47,690 xinference.core.model 3980157 DEBUG    Request chat, current serve request count: 0, request limit: None for the model qwen2:7b
2024-09-04 15:04:47,690 xinference.model.llm.vllm.core 3980157 DEBUG    Enter generate, prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
hi<|im_end|>
<|im_start|>assistant
Hello! How can I assist you today?<|im_end|>
<|im_start|>user
你是谁<|im_end|>
<|im_start|>assistant
, generate config: {'max_tokens': 512, 'temperature': 1.0, 'stream': True, 'lora_name': '', 'stop': ['<|endoftext|>', '<|im_start|>', '<|im_end|>'], 'stop_token_ids': [151643, 151644, 151645]}
2024-09-04 15:04:47,690 xinference.core.model 3980157 DEBUG    After request chat, current serve request count: 0 for the model qwen2:7b
2024-09-04 15:04:47,690 xinference.core.model 3980157 DEBUG    Leave wrapped_func, elapsed time: 0 s
2024-09-04 15:04:47,691 vllm.engine.async_llm_engine 3980157 INFO     Added request f8ada946-6a8b-11ef-9f2a-fa163e429a2f.
2024-09-04 15:04:47,692 vllm.engine.async_llm_engine 3980157 DEBUG    Got new requests!
2024-09-04 15:04:47,760 uvicorn.access 3657230 INFO     117.136.44.179:41892 - "GET /qwen2%3A7b/queue/data?session_hash=9e23nax0pmk HTTP/1.1" 200
2024-09-04 15:04:47,893 vllm.engine.metrics 3980157 INFO     Avg prompt throughput: 3.1 tokens/s, Avg generation throughput: 0.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
2024-09-04 15:04:48,738 vllm.engine.async_llm_engine 3980157 INFO     Finished request f8ada946-6a8b-11ef-9f2a-fa163e429a2f.
2024-09-04 15:04:48,738 vllm.engine.async_llm_engine 3980157 DEBUG    Waiting for new requests...
2024-09-04 15:05:07,379 vllm.engine.async_llm_engine 3980157 DEBUG    Starting health check...
2024-09-04 15:05:07,380 vllm.engine.async_llm_engine 3980157 DEBUG    Health check took 0.000216s
2024-09-04 15:05:16,315 uvicorn.access 3657230 INFO     117.136.44.179:41893 - "POST /qwen2%3A7b/run/predict HTTP/1.1" 200
2024-09-04 15:05:37,381 vllm.engine.async_llm_engine 3980157 DEBUG    Starting health check...
2024-09-04 15:05:37,381 vllm.engine.async_llm_engine 3980157 DEBUG    Health check took 0.000220s
2024-09-04 15:06:07,382 vllm.engine.async_llm_engine 3980157 DEBUG    Starting health check...
2024-09-04 15:06:07,382 vllm.engine.async_llm_engine 3980157 DEBUG    Health check took 0.000223s
2024-09-04 15:06:37,383 vllm.engine.async_llm_engine 3980157 DEBUG    Starting health check...
2024-09-04 15:06:37,383 vllm.engine.async_llm_engine 3980157 DEBUG    Health check took 0.000208s
2024-09-04 15:07:07,384 vllm.engine.async_llm_engine 3980157 DEBUG    Starting health check...
2024-09-04 15:07:07,384 vllm.engine.async_llm_engine 3980157 DEBUG    Health check took 0.000224s
2024-09-04 15:07:37,386 vllm.engine.async_llm_engine 3980157 DEBUG    Starting health check...
2024-09-04 15:07:37,386 vllm.engine.async_llm_engine 3980157 DEBUG    Health check took 0.000227s
2024-09-04 15:08:07,388 vllm.engine.async_llm_engine 3980157 DEBUG    Starting health check...
2024-09-04 15:08:07,388 vllm.engine.async_llm_engine 3980157 DEBUG    Health check took 0.000228s
2024-09-04 15:08:29,187 uvicorn.access 3657230 INFO     117.136.44.179:41894 - "POST /qwen2%3A7b/run/predict HTTP/1.1" 200
2024-09-04 15:08:29,543 uvicorn.access 3657230 INFO     117.136.44.179:41895 - "POST /qwen2%3A7b/run/predict HTTP/1.1" 200
2024-09-04 15:08:29,880 uvicorn.access 3657230 INFO     117.136.44.179:41896 - "POST /qwen2%3A7b/run/predict HTTP/1.1" 200
2024-09-04 15:08:29,996 uvicorn.access 3657230 INFO     117.136.44.179:41897 - "POST /qwen2%3A7b/queue/join HTTP/1.1" 200
2024-09-04 15:08:30,035 uvicorn.access 3657230 INFO     127.0.0.1:56418 - "GET /v1/cluster/auth HTTP/1.1" 200
2024-09-04 15:08:30,038 xinference.core.supervisor 3657471 DEBUG    Enter describe_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fd753553290>, 'qwen2:7b'), kwargs: {}
2024-09-04 15:08:30,038 xinference.core.worker 3657471 DEBUG    Enter describe_model, args: (<xinference.core.worker.WorkerActor object at 0x7fd75c783ab0>,), kwargs: {'model_uid': 'qwen2:7b-1-0'}
2024-09-04 15:08:30,038 xinference.core.worker 3657471 DEBUG    Leave describe_model, elapsed time: 0 s
2024-09-04 15:08:30,038 xinference.core.supervisor 3657471 DEBUG    Leave describe_model, elapsed time: 0 s
2024-09-04 15:08:30,039 uvicorn.access 3657230 INFO     127.0.0.1:56422 - "GET /v1/models/qwen2%3A7b HTTP/1.1" 200
2024-09-04 15:08:30,041 xinference.core.supervisor 3657471 DEBUG    Enter get_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fd753553290>, 'qwen2:7b'), kwargs: {}
2024-09-04 15:08:30,041 xinference.core.worker 3657471 DEBUG    Enter get_model, args: (<xinference.core.worker.WorkerActor object at 0x7fd75c783ab0>,), kwargs: {'model_uid': 'qwen2:7b-1-0'}
2024-09-04 15:08:30,041 xinference.core.worker 3657471 DEBUG    Leave get_model, elapsed time: 0 s
2024-09-04 15:08:30,041 xinference.core.supervisor 3657471 DEBUG    Leave get_model, elapsed time: 0 s
2024-09-04 15:08:30,042 xinference.core.supervisor 3657471 DEBUG    Enter describe_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fd753553290>, 'qwen2:7b'), kwargs: {}
2024-09-04 15:08:30,042 xinference.core.worker 3657471 DEBUG    Enter describe_model, args: (<xinference.core.worker.WorkerActor object at 0x7fd75c783ab0>,), kwargs: {'model_uid': 'qwen2:7b-1-0'}
2024-09-04 15:08:30,042 xinference.core.worker 3657471 DEBUG    Leave describe_model, elapsed time: 0 s
2024-09-04 15:08:30,042 xinference.core.supervisor 3657471 DEBUG    Leave describe_model, elapsed time: 0 s
2024-09-04 15:08:30,043 uvicorn.access 3657230 INFO     127.0.0.1:56436 - "POST /v1/chat/completions HTTP/1.1" 200
2024-09-04 15:08:30,043 xinference.core.model 3980157 DEBUG    Enter wrapped_func, args: (<xinference.core.model.ModelActor object at 0x7fe06dbb3010>, '你是谁', '', [{'role': 'user', 'content': 'hi'}, {'role': 'assistant', 'content': 'Hello! How can I assist you today?'}], {'max_tokens': 512, 'temperature': 1.0, 'stream': True, 'lora_name': ''}), kwargs: {'raw_params': {'max_tokens': 512, 'temperature': 1, 'stream': True, 'lora_name': ''}}
2024-09-04 15:08:30,044 xinference.core.model 3980157 DEBUG    Request chat, current serve request count: 0, request limit: None for the model qwen2:7b
2024-09-04 15:08:30,044 xinference.model.llm.vllm.core 3980157 DEBUG    Enter generate, prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
hi<|im_end|>
<|im_start|>assistant
Hello! How can I assist you today?<|im_end|>
<|im_start|>user
你是谁<|im_end|>
<|im_start|>assistant
, generate config: {'max_tokens': 512, 'temperature': 1.0, 'stream': True, 'lora_name': '', 'stop': ['<|endoftext|>', '<|im_start|>', '<|im_end|>'], 'stop_token_ids': [151643, 151644, 151645]}
2024-09-04 15:08:30,044 xinference.core.model 3980157 DEBUG    After request chat, current serve request count: 0 for the model qwen2:7b
2024-09-04 15:08:30,044 xinference.core.model 3980157 DEBUG    Leave wrapped_func, elapsed time: 0 s
2024-09-04 15:08:30,045 vllm.engine.async_llm_engine 3980157 INFO     Added request 7d363354-6a8c-11ef-9f2a-fa163e429a2f.
2024-09-04 15:08:30,046 vllm.engine.async_llm_engine 3980157 DEBUG    Got new requests!
2024-09-04 15:08:30,113 uvicorn.access 3657230 INFO     117.136.44.179:41898 - "GET /qwen2%3A7b/queue/data?session_hash=9e23nax0pmk HTTP/1.1" 200
2024-09-04 15:08:30,251 vllm.engine.metrics 3980157 INFO     Avg prompt throughput: 0.2 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
2024-09-04 15:08:31,140 vllm.engine.async_llm_engine 3980157 INFO     Finished request 7d363354-6a8c-11ef-9f2a-fa163e429a2f.
2024-09-04 15:08:31,140 vllm.engine.async_llm_engine 3980157 DEBUG    Waiting for new requests...
2024-09-04 15:08:37,389 vllm.engine.async_llm_engine 3980157 DEBUG    Starting health check...
2024-09-04 15:08:37,389 vllm.engine.async_llm_engine 3980157 DEBUG    Health check took 0.000205s
2024-09-04 15:08:46,570 uvicorn.access 3657230 INFO     117.136.44.179:41899 - "POST /qwen2%3A7b/run/predict HTTP/1.1" 200

Expected behavior / 期待表现

vllm的推理速度恢复正常

xorbitsai / inference