嵌入模型，input过长造成显存溢出

yushengliao commented 2 weeks ago

System Info / 系統信息

cuda:12.4 centeos7.9

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[X] docker / docker
[ ] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

0.14.3

The command used to start Xinference / 用以启动 xinference 的命令

version: '3.8'

services: xinference: image: xprobe/xinference:v0.14.3 container_name: xinference volumes:

/mnt/lvdata/xinference/.xinference:/root/.xinference
/mnt/lvdata/xinference/.cache/huggingface:/root/.cache/huggingface
/mnt/lvdata/xinference/.cache/modelscope:/root/.cache/modelscope
/mnt/lvdata/xinference/config.json:/root/config.json ports:
"9997:9997" environment:
XINFERENCE_MODEL_SRC=modelscope deploy: resources: reservations: devices:
- capabilities: [gpu] command: ["xinference-local", "-H", "0.0.0.0", "--log-level", "debug", "--auth-config", "/root/config.json"]

Reproduction / 复现过程

1.使用bge-m3模型，其他模型未测出

入参{"model":"bge-m3","input":[str1,str2,str3,....]} 如果某个str的长度过大，比如8K或者10K，则会显存溢出.

return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/sentence_transformers/models/Transformer.py", line 118, in forward output_states = self.auto_model(trans_features, return_dict=False) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 834, in forward encoder_outputs = self.encoder( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 522, in forward layer_outputs = layer_module( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 411, in forward self_attention_outputs = self.attention( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 338, in forward self_outputs = self.self( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 227, in forward attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2)) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.39 GiB. GPU 0 has a total capacity of 21.98 GiB of which 6.29 GiB is free. Process 3860 has 1.02 GiB memory in use. Process 5179 has 6.97 GiB memory in use. Process 21641 has 2.94 GiB memory in use. Process 24170 has 4.73 GiB memory in use. Of the allocated memory 3.47 GiB is allocated by PyTorch, and 1010.19 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2024-09-03 09:33:43,410 xinference.api.restful_api 1 ERROR Remote server 0.0.0.0:39316 closed Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 1189, in create_embedding embedding = await model.create_embedding(body.input, **kwargs) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 230, in send result = await self._wait(future, actor_ref.address, send_message) # type: ignore File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 115, in _wait return await future File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/core.py", line 84, in _listen raise ServerClosed( xoscar.errors.ServerClosed: Remote server 0.0.0.0:39316 closed 2024-09-03 09:33:43,716 xinference.core.worker 44 WARNING Process 0.0.0.0:39316 is down. 2024-09-03 09:33:43,718 xinference.core.worker 44 DEBUG Enter terminate_model, args: (<xinference.core.worker.WorkerActor object at 0x7f6f6d174d60>, 'bge-m3-1-0'), kwargs: {} 2024-09-03 09:33:43,718 xinference.core.worker 44 DEBUG Destroy model actor failed, model uid: bge-m3-1-0, error: [Errno 111] Connection refused 2024-09-03 09:33:43,718 xinference.core.worker 44 DEBUG Remove sub pool failed, model uid: bge-m3-1-0, error: '0.0.0.0:39316' 2024-09-03 09:33:43,718 xinference.core.worker 44 DEBUG Leave terminate_model, elapsed time: 0 s 2024-09-03 09:33:43,718 xinference.core.worker 44 WARNING Recreating model actor bge-m3-1-0 ... 2024-09-03 09:33:43,719 xinference.core.worker 44 DEBUG Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f6f6d174d60>,), kwargs: {'model_uid': 'bge-m3-1-0', 'model_name': 'bge-m3', 'model_size_in_billions': None, 'model_format': None, 'quantization': None, 'model_engine': None, 'model_type': 'embedding', 'n_gpu': 'auto', 'peft_model_config': None, 'request_limits': None, 'gpu_idx': None, 'download_hub': None, 'model_path': '/root/.xinference/cache/bge-m3'} 2024-09-03 09:33:43,719 xinference.core.worker 44 DEBUG GPU selected: [0] for model bge-m3-1-0 2024-09-03 09:33:50,057 xinference.model.embedding.core 44 DEBUG Embedding model bge-m3 found in ModelScope. 2024-09-03 09:33:51,379 transformers.configuration_utils 1367 INFO loading configuration file /root/.xinference/cache/bge-m3/config.json 2024-09-03 09:33:51,379 transformers.dynamic_module_utils 1367 INFO Patched resolve_trust_remote_code: (False, '/root/.xinference/cache/bge-m3', True, False) {} 2024-09-03 09:33:51,380 transformers.configuration_utils 1367 INFO Model config XLMRobertaConfig { "_name_or_path": "/root/.xinference/cache/bge-m3", "architectures": [ "XLMRobertaModel" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "classifier_dropout": null, "eos_token_id": 2, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "initializer_range": 0.02, "intermediate_size": 4096, "layer_norm_eps": 1e-05, "max_position_embeddings": 8194, "model_type": "xlm-roberta", "num_attention_heads": 16, "num_hidden_layers": 24, "output_past": true, "pad_token_id": 1, "position_embedding_type": "absolute", "torch_dtype": "float32", "transformers_version": "4.43.4", "type_vocab_size": 1, "use_cache": true, "vocab_size": 250002 }

2024-09-03 09:33:51,438 transformers.dynamic_module_utils 1367 INFO Patched resolve_trust_remote_code: (False, '/root/.xinference/cache/bge-m3', True, False) {} 2024-09-03 09:33:51,441 transformers.modeling_utils 1367 INFO loading weights file /root/.xinference/cache/bge-m3/model.safetensors 2024-09-03 09:33:51,553 transformers.modeling_utils 1367 INFO All model checkpoint weights were used when initializing XLMRobertaModel.

2024-09-03 09:33:51,553 transformers.modeling_utils 1367 INFO All the weights of XLMRobertaModel were initialized from the model checkpoint at /root/.xinference/cache/bge-m3. If your task is similar to the task the model of the checkpoint was trained on, you can already use XLMRobertaModel for predictions without further training. 2024-09-03 09:33:51,557 transformers.dynamic_module_utils 1367 INFO Patched resolve_trust_remote_code: (False, '/root/.xinference/cache/bge-m3', True, False) {} 2024-09-03 09:33:51,557 transformers.tokenization_utils_base 1367 INFO loading file sentencepiece.bpe.model 2024-09-03 09:33:51,557 transformers.tokenization_utils_base 1367 INFO loading file tokenizer.json 2024-09-03 09:33:51,558 transformers.tokenization_utils_base 1367 INFO loading file added_tokens.json 2024-09-03 09:33:51,558 transformers.tokenization_utils_base 1367 INFO loading file special_tokens_map.json 2024-09-03 09:33:51,558 transformers.tokenization_utils_base 1367 INFO loading file tokenizer_config.json 2024-09-03 09:33:53,016 xinference.core.worker 44 DEBUG Leave launch_builtin_model, elapsed time: 9 s

43c221dc9cab97615881cf51b1f3743

Expected behavior / 期待表现

显存不溢出

yushengliao commented 2 weeks ago

补充一下，经过测试应该是input数组，一次传入数量过多造成的。我这边在分批请求就不会报错。

建议xinference这边接收到input数组过大的时候分批处理，或者有别的优化方案？

Valdanitooooo commented 2 weeks ago

是怎么调用的，用OpenAIEmbeddings的话应该可以尝试调整chunk_size

 OpenAIEmbeddings(
    openai_api_base="http://xxxx:9997/v1",
    openai_api_key="xxx",
    chunk_size=1000,
    model="xxx",
)

yushengliao commented 1 week ago

是怎么调用的，用OpenAIEmbeddings的话应该可以尝试调整chunk_size
 OpenAIEmbeddings(
    openai_api_base="http://xxxx:9997/v1",
    openai_api_key="xxx",
    chunk_size=1000,
    model="xxx",
)

尝试了，没有chunk_size参数。

另外，当请求内容多，xinference分配的显存会很多了，而且不管剩余显存的情况，造成OOM。而ollama就不会

github-actions[bot] commented 6 days ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 1 day ago

This issue was closed because it has been inactive for 5 days since being marked as stale.

linqingxu commented 1 hour ago

补充一下，经过测试应该是input数组，一次传入数量过多造成的。我这边在分批请求就不会报错。

建议xinference这边接收到input数组过大的时候分批处理，或者有别的优化方案？

请问是如何修改的分批处理，这个问题我这边也出现了

xorbitsai / inference