xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
5.55k stars 456 forks source link

Upgrade vllm and sglang to new version and support gemma model correctly #1869

Closed vikrantrathore closed 4 months ago

vikrantrathore commented 4 months ago

Feature request / 功能建议

At present the vllm and sglang used in xinference are older. New sglang supports gemma-2 models out of the box, but in xinference engine sglang it does not support. Same is the case with vllm, at present vllm is 0.5.1 also supports gemma-2 out of the box. The current implementation of gemma in xinference is giving error and do not stop the token generation probably due to new 4096 token sliding window in gemma 2.

Motivation / 动机

Upgrade to new version fo sglang and vllm

Your contribution / 您的贡献

No PR yet but can create one.

qinxuye commented 4 months ago

OK, thanks, we will support it, actually it's not hard to add a model, do you have interest to contribute?

Refer to code blow to add new models to vLLM.

https://github.com/xorbitsai/inference/blob/e80910d9ec159b04c950a47910c6630c3f16e27c/xinference/model/llm/vllm/core.py#L133-L149

vikrantrathore commented 4 months ago

I tried to add the code for sglang for gemma 2 model, but the system still gives error most likely need to check how model_family in custom LLM is handled, in Fastchat works without any errors using fastchat.serve.sglang_worker and do not need to provide some prompt style as it automatically picks for it.

Gemma is registered with xinference as custom model as its already on my server and wanted to use it.

Following are the details of changes and custom LLM file.

https://github.com/xorbitsai/inference/blob/e80910d9ec159b04c950a47910c6630c3f16e27c/xinference/model/llm/sglang/core.py#L65-L75

to

SGLANG_SUPPORTED_MODELS = ["llama-2", "mistral-v0.1", "mixtral-v0.1"] 
SGLANG_SUPPORTED_CHAT_MODELS = [ 
     "llama-2-chat", 
     "qwen-chat", 
     "qwen1.5-chat", 
     "mistral-instruct-v0.1", 
     "mistral-instruct-v0.2", 
     "mixtral-instruct-v0.1", 
     "gemma-it",
     "gemma-2-it",
    #"test-gemma-2-it" # commented out as it is not working 
 ] 

Custom model registration file named gemma.json:

{ 
    "version": 1,
    "context_length": 8192,
    "model_name": "test-gemma-2-it",
    "model_lang": [
    "en",
    "zh"
    ],
    "model_ability": [
      "chat"
    ],
    "model_description": "Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.",
    "model_family": "gemma-2-it",
    "model_specs": [
      {
        "model_format": "pytorch",
        "model_size_in_billions": 9,
        "quantizations": [
          "none",
          "4-bit",
          "8-bit"
        ],
          "model_id": "gemma-2-9b-it",
      "model_uri": "file:///home/ubuntu/projects/llm_models/gemma/gemma-2-9b-it"
      },
      {
        "model_format": "pytorch",
        "model_size_in_billions": 27,
        "quantizations": [
          "none",
          "4-bit",
          "8-bit"
        ],
          "model_id": "gemma-2-27b-it",
      "model_uri": "file:///home/ubuntu/projects/llm_models/gemma/gemma-2-27b-it"
      }
    ],
    "prompt_style": {
    "style_name": "gemma",
    "roles": [
        "user",
        "model"
    ],
        "stop": [
        "<end_of_turn>",
        "<start_of_turn>"
    ]
    }
  }

Errors encountered while trying to launch after successfull registration.

xinference launch --model-engine sglang -n test-gemma-2-it -u gemma-2-9b-it --max_model_len 8192 --gpu_memory_utilization 0.90 -e http://127.0.0.1:18888 --api-key "sk-testapikey"

2024-07-16 13:36:54,197 xinference.api.restful_api 205487 ERROR    [address=0.0.0.0:60601, pid=205563] Model test-gemma-2-it cannot be run on engine sglang.
Traceback (most recent call last):
  File "/home/ubuntu/projects/inference/xinference/api/restful_api.py", line 835, in launch_model
    model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/testai_inference/.venv/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/testai_inference/.venv/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/ubuntu/projects/testai_inference/.venv/lib/python3.11/site-packages/xoscar/backends/pool.py", line 656, in
  File "/home/ubuntu/projects/inference/xinference/core/supervisor.py", line 988, in launch_builtin_model
    await _launch_model()
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/inference/xinference/core/supervisor.py", line 952, in _launch_model
    await _launch_one_model(rep_model_uid)
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/inference/xinference/core/supervisor.py", line 932, in _launch_one_model
    await worker_ref.launch_builtin_model(
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper
    async with lock:
  File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper
    result = await result
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/inference/xinference/core/utils.py", line 45, in wrapped
    ret = await func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/inference/xinference/core/worker.py", line 816, in launch_builtin_model
    model, model_description = await asyncio.to_thread(
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.pyenv/versions/3.11.2/lib/python3.11/asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
      ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/inference/xinference/model/core.py", line 69, in create_model_instance
    return create_llm_model_instance(
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/inference/xinference/model/llm/core.py", line 215, in create_llm_model_instance
    llm_cls = check_engine_by_spec_parameters(
      ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/inference/xinference/model/llm/llm_family.py", line 1193, in check_engine_by_spec_parameters
    raise ValueError(f"Model {model_name} cannot be run on engine {model_engine}.")
    ^^^^^^^^^^^^^^^^^
ValueError: [address=0.0.0.0:60601, pid=205563] Model test-gemma-2-it cannot be run on engine sglang.
qinxuye commented 4 months ago

Now environment variable XINFERENCE_ENABLE_SGLANG=1 should be added to enable sglang, did you add it?

vikrantrathore commented 4 months ago

Now environment variable XINFERENCE_ENABLE_SGLANG=1 should be added to enable sglang, did you add it?

I have tried it and after installing flashinfer and updating sglang, vllm could run gemma 2 model. But it then gives another error after starting the model worker, its related to some usage api called after starting the sglang model worker. Also the inference doesn't work although the sglang worker is running. Will try to do some more experiment. But if you are trying to upgrade the vllm to 0.5.2 and sgalng to 0..1.21 it might still give error.

Following are the details of the error:

XINFERENCE_ENABLE_SGLANG=1 xinference-local --host 0.0.0.0 --port 18888 --auth-config custom_auth.json
2024-07-19 04:20:22,003 xinference.core.supervisor 17695 INFO     Xinference supervisor 0.0.0.0:36702 started
2024-07-19 04:20:22,037 xinference.model.image.core 17695 WARNING  Cannot find builtin image model spec: stable-diffusion-inpainting
2024-07-19 04:20:22,038 xinference.model.image.core 17695 WARNING  Cannot find builtin image model spec: stable-diffusion-2-inpainting
2024-07-19 04:20:22,172 xinference.core.worker 17695 INFO     Starting metrics export server at 0.0.0.0:None
2024-07-19 04:20:22,173 xinference.core.worker 17695 INFO     Checking metrics export server...
2024-07-19 04:20:24,678 xinference.core.worker 17695 INFO     Metrics server is started at: http://0.0.0.0:42793
2024-07-19 04:20:24,679 xinference.core.worker 17695 INFO     Xinference worker 0.0.0.0:36702 started
2024-07-19 04:20:24,679 xinference.core.worker 17695 INFO     Purge cache directory: /home/ubuntu/.xinference/cache
2024-07-19 04:20:28,055 xinference.api.restful_api 17622 INFO     Starting Xinference at endpoint: http://0.0.0.0:18888
2024-07-19 04:20:28,685 uvicorn.error 17622 INFO     Uvicorn running on http://0.0.0.0:18888 (Press CTRL+C to quit)
2024-07-19 04:20:53,625 xinference.model.llm.llm_family 17695 INFO     Caching from URI: file:///home/ubuntu/projects/llm_models/gemma/gemma-2-9b-it
2024-07-19 04:20:53,626 xinference.model.llm.llm_family 17695 INFO     Cache /home/ubuntu/projects/llm_models/gemma/gemma-2-9b-it exists
2024-07-19 04:20:53,642 xinference.model.llm.sglang.core 17934 INFO     Loading gemma-2-9b-it with following model config: {'trust_remote_code': True, 'tokenizer_mode': 'auto', 'tp_size': 1
, 'mem_fraction_static': 0.9, 'log_level': 'info', 'attention_reduce_in_fp32': False}
[gpu_id=0] Init nccl begin.
[gpu_id=0] Load weight begin. avail mem=23.17 GB
WARNING 07-19 04:21:07 utils.py:558] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length
 to the sliding window size (4096).
WARNING 07-19 04:21:08 interfaces.py:131] The model (<class 'sglang.srt.models.gemma2.Gemma2ForCausalLM'>) contains all LoRA-specific attributes, but does not set `supports_lora=True`.
[gpu_id=0] Load weight end. type=Gemma2ForCausalLM, dtype=torch.bfloat16, avail mem=5.72 GB
[gpu_id=0] Memory pool end. avail mem=2.26 GB
[gpu_id=0] Capture cuda graph begin.
[gpu_id=0] max_total_num_tokens=10611, max_prefill_tokens=16384, context_len=8192
[gpu_id=0] server_args: disable_flashinfer=False, attention_reduce_in_fp32=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_disk_cache=False,
INFO:     Started server process [18011]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
INFO:     127.0.0.1:40156 - "GET /get_model_info HTTP/1.1" 200 OK
[gpu_id=0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0
INFO:     127.0.0.1:40168 - "POST /generate HTTP/1.1" 200 OK
The server is fired up and ready to roll!
INFO:     127.0.0.1:40184 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:37770 - "POST /generate HTTP/1.1" 200 OK
[gpu_id=0] Prefill batch. #new-seq: 1, #new-token: 17, #cached-token: 1, cache hit rate: 4.00%, #running-req: 0, #queue-req: 0
2024-07-19 04:22:18,208 xinference.api.restful_api 17622 ERROR    Chat completion stream got an error: [address=0.0.0.0:36671, pid=17934] cannot access free variable 'include_usage' where it is not associated with a value in enclosing scope
Traceback (most recent call last):
File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xinference/api/restful_api.py", line 1660, in stream_results
  async for item in iterator:                                                               
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xoscar/api.py", line 340, in __anext__
    return await self._actor_ref.__xoscar_next__(self._uid)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xoscar/backends/pool.py", line 656, in send
    result = await self._run_coro(message.message_id, coro)
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
    return await coro
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xoscar/api.py", line 431, in __xoscar_next__
    raise e
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xoscar/api.py", line 419, in __xoscar_next__
    r = await asyncio.create_task(_async_wrapper(gen))
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xoscar/api.py", line 409, in _async_wrapper
    return await _gen.__anext__()  # noqa: F821
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xinference/core/model.py", line 355, in _to_async_gen
    async for v in gen:
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xinference/model/llm/utils.py", line 574, in _async_to_chat_completion_chunks
    async for chunk in chunks:
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xinference/model/llm/sglang/core.py", line 309, in stream_results
    if include_usage:
NameError: [address=0.0.0.0:36671, pid=17934] cannot access free variable 'include_usage' where it is not associated with a value in enclosing scope
qinxuye commented 4 months ago

@wxiwnd Can you take a look to see this? why the include_usage gives some error?

qinxuye commented 4 months ago

Hi @vikrantrathore , can you try to merge the main branch to see if you can move forward?

vikrantrathore commented 4 months ago

Hi @vikrantrathore , can you try to merge the main branch to see if you can move forward?

Yes the changes you made now works for sglang and tokens/s is 2x vllm even though both are using flashinfer-0.0.8. But still face few smaller issues:

  1. I am using sglang==0.1.21, vllm==0.5.2, these results in upgrade of other packages to torch-2.3.1, torchvision-0.18.1, triton-2.3.1, vllm-flash-attn-2.5.9.post1, xformers-0.0.27, lm-format-enforcer-0.10.3. This means torch-2.3.1 is incompatible with torchaudio-2.3.0 (which requires torch 2.3.0).
  2. Face issue with parallel generation as its using some ggml code for streams processing. My thinking is making sglang code similar to fastchat's sglang_worker will solve this problem.
[gpu_id=0] Prefill batch. #new-seq: 1, #new-token: 41, #cached-token: 7, cache hit rate: 38.65%, #running-req: 0, #queue-req: 0
[gpu_id=0] Decode batch. #running-req: 1, #token: 60, token usage: 0.01, gen throughput (token/s): 1.18, #queue-req: 0
[gpu_id=0] Decode batch. #running-req: 1, #token: 100, token usage: 0.01, gen throughput (token/s): 43.61, #queue-req: 0
[gpu_id=0] Decode batch. #running-req: 1, #token: 140, token usage: 0.02, gen throughput (token/s): 43.27, #queue-req: 0
[gpu_id=0] Decode batch. #running-req: 1, #token: 180, token usage: 0.02, gen throughput (token/s): 43.12, #queue-req: 0
[gpu_id=0] Decode batch. #running-req: 1, #token: 220, token usage: 0.02, gen throughput (token/s): 43.06, #queue-req: 0
[gpu_id=0] Decode batch. #running-req: 1, #token: 260, token usage: 0.03, gen throughput (token/s): 42.99, #queue-req: 0
return await super().__on_receive__(message)  # type: ignore
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xinference/core/utils.py", line 45, in wrapped
    ret = await func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xinference/core/model.py", line 90, in wrapped_func
    ret = await fn(self, *args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xoscar/api.py", line 462, in _wrapper
    r = await func(self, *args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xinference/core/model.py", line 528, in chat
    response = await self._call_wrapper_json(
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xinference/core/model.py", line 393, in _call_wrapper_json
    return await self._call_wrapper("json", fn, *args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xinference/core/model.py", line 114, in _async_wrapper
    return await fn(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/projects/testai/.venv/lib/python3.11/site-packages/xinference/core/model.py", line 413, in _call_wrapper
    raise Exception("Parallel generation is not supported by ggml.")
    ^^^^^^^^^^^^^^^^^
Exception: [address=0.0.0.0:43267, pid=33178] Parallel generation is not supported by ggml.
qinxuye commented 4 months ago

Do you have any update?

vikrantrathore commented 4 months ago

Do you have any update?

I have submitted Pull Request #1929 which addresses these changes. The updates work on my system and require CUDA 12.1, as Flashinfer is installed as a wheel compatible with CUDA 12.1. Additionally, the minimum required Python version for these changes is 3.11.