xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.56k stars 357 forks source link

RuntimeError: [address=0.0.0.0:43266, pid=897] CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. #1220

Closed qingjiaozyn closed 2 weeks ago

qingjiaozyn commented 5 months ago

在A8002卡(每张卡80G显存)机器上启动xinference,启动了两个模型,第一个是deepseek模型,启动正常。第二个是qwen-14B模型 当在卡1上启动qwen-14B模型会报下面的错误 allel_size': 1, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096} INFO 03-29 18:28:18 llm_engine.py:87] Initializing an LLM engine with config: model='/root/.xinference/cache/merge_qwen_ccb-pytorch-14b', tokenizer='/root/.xinference/cache/merge_qwen_ccb-pytorch-14b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0) WARNING 03-29 18:28:18 tokenizer.py:64] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. INFO 03-29 18:28:50 llm_engine.py:357] # GPU blocks: 3531, # CPU blocks: 327 2024-03-29 18:28:51,319 xinference.core.worker 75 ERROR Failed to load model merge_qwen_ccb-1-0 Traceback (most recent call last): File "/app/xinference/xinference/core/worker.py", line 569, in launch_builtin_model await model_ref.load() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send result = await self._run_coro(message.message_id, coro) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 524, in xoscar.core._BaseActor.__on_receive result = func(*args, kwargs) File "/app/xinference/xinference/core/model.py", line 239, in load self._model.load() File "/app/xinference/xinference/model/llm/vllm/core.py", line 147, in load self._engine = AsyncLLMEngine.from_engine_args(engine_args) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 628, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 321, in init self.engine = self._init_engine(*args, *kwargs) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 369, in _init_engine return engine_class(args, kwargs) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 131, in init self._init_cache() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 377, in _init_cache self._run_workers("init_cache_engine", cache_config=self.cache_config) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1041, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 150, in init_cache_engine self.cache_engine = CacheEngine(self.cache_config, self.model_config, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 52, in init self.cpu_cache = self.allocate_cpu_cache() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 106, in allocate_cpu_cache key_blocks = torch.empty( RuntimeError: [address=0.0.0.0:43266, pid=897] CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

2024-03-29 18:28:51,875 xinference.api.restful_api 8 ERROR [address=0.0.0.0:43266, pid=897] CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. Traceback (most recent call last): File "/app/xinference/xinference/api/restful_api.py", line 793, in launch_model model_uid = await ( File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send result = await self._run_coro(message.message_id, coro) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive result = await result File "/app/xinference/xinference/core/supervisor.py", line 803, in launch_builtin_model await _launch_model() File "/app/xinference/xinference/core/supervisor.py", line 767, in _launch_model await _launch_one_model(rep_model_uid) File "/app/xinference/xinference/core/supervisor.py", line 748, in _launch_one_model await worker_ref.launch_builtin_model( File "xoscar/core.pyx", line 284, in pyx_actor_method_wrapper async with lock: File "xoscar/core.pyx", line 287, in xoscar.core.pyx_actor_method_wrapper result = await result File "/app/xinference/xinference/core/utils.py", line 45, in wrapped ret = await func(*args, **kwargs) File "/app/xinference/xinference/core/worker.py", line 569, in launch_builtin_model await model_ref.load() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send result = await self._run_coro(message.message_id, coro) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 524, in xoscar.core._BaseActor.on_receive result = func(*args, **kwargs) File "/app/xinference/xinference/core/model.py", line 239, in load self._model.load() File "/app/xinference/xinference/model/llm/vllm/core.py", line 147, in load self._engine = AsyncLLMEngine.from_engine_args(engine_args) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 628, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 321, in init self.engine = self._init_engine(*args, *kwargs) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 369, in _init_engine return engine_class(args, **kwargs) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 131, in init__ self._init_cache() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 377, in _init_cache self._run_workers("init_cache_engine", cache_config=self.cache_config) File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1041, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 150, in init_cache_engine self.cache_engine = CacheEngine(self.cache_config, self.model_config, File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 52, in init__ self.cpu_cache = self.allocate_cpu_cache() File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 106, in allocate_cpu_cache key_blocks = torch.empty( RuntimeError: [address=0.0.0.0:43266, pid=897] CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

self._lock None, inspect.iscoroutinefunction(fn) True inspect.isgenerator(ret) False

mujin2 commented 4 months ago

@qingjiaozyn 可以提供一下完整日志吗,包括第一个deepseek模型的日志

qinxuye commented 4 months ago

qwen 14b 不支持两张卡,这个是模型本身的限制。最新的 qwen 1.5 1.4b 听说最近解决了,要下载最新的模型文件。

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been inactive for 5 days since being marked as stale.