xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.78k stars 377 forks source link

xinference launch无法挂载到GPU #1817

Closed zhaozhizhuo closed 2 months ago

zhaozhizhuo commented 2 months ago

xinference register --model-type LLM --file Qwen1.5-7B-Chat.json --persist xinference launch --model-name qwen0.5b-langchain --model-format pytorch --model-engine Transformers 该命令以后模型正常出现的localhost:9997网页中,也可以进行chat对话,但是模型并没有加载到GPU中进行推理

ChengjieLi28 commented 2 months ago

@zhaozhizhuo 提供点信息,注册的json是什么样的?launch过程中日志是什么样的?

zhaozhizhuo commented 2 months ago

注册的json为:{ "version": 1, "context_length": 2048, "model_name": "qwen2-7b-langchain", "model_lang": [ "en", "zh" ], "model_ability": [ "chat", "tools" ], "model_family": "qwen2-instruct", "model_specs": [ { "model_format": "pytorch", "model_size_in_billions": 7, "quantizations": [ "4-bit", "8-bit", "none" ], "model_id": "Qwen/Qwen2-7B-Instruct", "model_uri": "/copydata2/zhaozhizhuo/qwen2/" } ] } 并且xinference.log是:2024-07-09 09:54:41,139 xinference.core.supervisor 2412279 INFO Xinference supervisor 0.0.0.0:21093 started 2024-07-09 09:54:41,388 xinference.core.worker 2412279 INFO Starting metrics export server at 0.0.0.0:None 2024-07-09 09:54:41,391 xinference.core.worker 2412279 INFO Checking metrics export server... 2024-07-09 09:54:44,220 xinference.core.worker 2412279 INFO Metrics server is started at: http://0.0.0.0:46071 2024-07-09 09:54:44,221 xinference.core.worker 2412279 INFO Xinference worker 0.0.0.0:21093 started 2024-07-09 09:54:44,222 xinference.core.worker 2412279 INFO Purge cache directory: /copydata2/zhaozhizhuo/.xinference/cache 2024-07-09 09:54:50,426 xinference.api.restful_api 2411620 INFO Starting Xinference at endpoint: http://0.0.0.0:9997 2024-07-09 09:54:50,608 uvicorn.error 2411620 INFO Started server process [2411620] 2024-07-09 09:54:50,608 uvicorn.error 2411620 INFO Waiting for application startup. 2024-07-09 09:54:50,608 uvicorn.error 2411620 INFO Application startup complete. 2024-07-09 09:54:50,609 uvicorn.error 2411620 INFO Uvicorn running on http://0.0.0.0:9997 (Press CTRL+C to quit) 2024-07-09 09:56:06,337 uvicorn.access 2411620 INFO 127.0.0.1:50482 - "GET /v1/cluster/auth HTTP/1.1" 200 2024-07-09 09:56:09,254 xinference.model.llm.llm_family 2412279 INFO Caching from URI: /copydata2/zhaozhizhuo/qwen2/ 2024-07-09 09:56:09,255 xinference.model.llm.llm_family 2412279 INFO Cache /copydata2/zhaozhizhuo/qwen2 exists 2024-07-09 09:56:09,462 transformers.tokenization_utils_base 2414138 INFO loading file vocab.json 2024-07-09 09:56:09,462 transformers.tokenization_utils_base 2414138 INFO loading file merges.txt 2024-07-09 09:56:09,463 transformers.tokenization_utils_base 2414138 INFO loading file tokenizer.json 2024-07-09 09:56:09,463 transformers.tokenization_utils_base 2414138 INFO loading file added_tokens.json 2024-07-09 09:56:09,463 transformers.tokenization_utils_base 2414138 INFO loading file special_tokens_map.json 2024-07-09 09:56:09,463 transformers.tokenization_utils_base 2414138 INFO loading file tokenizer_config.json 2024-07-09 09:56:09,740 transformers.tokenization_utils_base 2414138 WARNING Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-07-09 09:56:09,741 transformers.configuration_utils 2414138 INFO loading configuration file /copydata2/zhaozhizhuo/qwen2/config.json 2024-07-09 09:56:09,760 transformers.configuration_utils 2414138 INFO Model config Qwen2Config { "_name_or_path": "/copydata2/zhaozhizhuo/qwen2", "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151643, "hidden_act": "silu", "hidden_size": 3584, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 131072, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": 131072, "tie_word_embeddings": false, "torch_dtype": "float32", "transformers_version": "4.41.2", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 }

2024-07-09 09:56:11,142 transformers.modeling_utils 2414138 INFO loading weights file /copydata2/zhaozhizhuo/qwen2/model.safetensors.index.json 2024-07-09 09:56:11,143 transformers.modeling_utils 2414138 INFO Instantiating Qwen2ForCausalLM model under default dtype torch.float32. 2024-07-09 09:56:11,145 transformers.generation.configuration_utils 2414138 INFO Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151643 }

2024-07-09 09:56:14,950 transformers.modeling_utils 2414138 INFO All model checkpoint weights were used when initializing Qwen2ForCausalLM.

2024-07-09 09:56:14,950 transformers.modeling_utils 2414138 INFO All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /copydata2/zhaozhizhuo/qwen2. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training. 2024-07-09 09:56:14,953 transformers.generation.configuration_utils 2414138 INFO loading configuration file /copydata2/zhaozhizhuo/qwen2/generation_config.json 2024-07-09 09:56:14,954 transformers.generation.configuration_utils 2414138 INFO Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151643, "max_new_tokens": 2048 }

2024-07-09 09:56:14,961 uvicorn.access 2411620 INFO 127.0.0.1:50496 - "POST /v1/models HTTP/1.1" 200 2024-07-09 10:27:48,044 uvicorn.access 2411620 INFO 127.0.0.1:53936 - "GET /v1/cluster/auth HTTP/1.1" 200 2024-07-09 10:27:50,915 xinference.model.utils 2412279 INFO Model caching from URI: /copydata2/zhaozhizhuo/bgemodel/ 2024-07-09 10:27:50,915 xinference.model.utils 2412279 INFO cache /copydata2/zhaozhizhuo/bgemodel exists 2024-07-09 10:28:00,152 transformers.configuration_utils 2452991 INFO loading configuration file /copydata2/zhaozhizhuo/bgemodel/config.json 2024-07-09 10:28:00,152 transformers.dynamic_module_utils 2452991 INFO Patched resolve_trust_remote_code: (False, '/copydata2/zhaozhizhuo/bgemodel', True, False) {} 2024-07-09 10:28:00,154 transformers.configuration_utils 2452991 INFO Model config BertConfig { "_name_or_path": "/copydata2/zhaozhizhuo/bgemodel", "architectures": [ "BertModel" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "classifier_dropout": null, "directionality": "bidi", "eos_token_id": 2, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "id2label": { "0": "LABEL_0" }, "initializer_range": 0.02, "intermediate_size": 4096, "label2id": { "LABEL_0": 0 }, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 16, "num_hidden_layers": 24, "output_past": true, "pad_token_id": 0, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "position_embedding_type": "absolute", "torch_dtype": "float32", "transformers_version": "4.41.2", "type_vocab_size": 2, "use_cache": true, "vocab_size": 21128 }

2024-07-09 10:28:01,212 transformers.dynamic_module_utils 2452991 INFO Patched resolve_trust_remote_code: (False, '/copydata2/zhaozhizhuo/bgemodel', True, False) {} 2024-07-09 10:28:01,235 transformers.modeling_utils 2452991 INFO loading weights file /copydata2/zhaozhizhuo/bgemodel/pytorch_model.bin 2024-07-09 10:28:08,973 transformers.modeling_utils 2452991 INFO All model checkpoint weights were used when initializing BertModel.

2024-07-09 10:28:08,974 transformers.modeling_utils 2452991 INFO All the weights of BertModel were initialized from the model checkpoint at /copydata2/zhaozhizhuo/bgemodel. If your task is similar to the task the model of the checkpoint was trained on, you can already use BertModel for predictions without further training. 2024-07-09 10:28:09,065 transformers.dynamic_module_utils 2452991 INFO Patched resolve_trust_remote_code: (False, '/copydata2/zhaozhizhuo/bgemodel', True, False) {} 2024-07-09 10:28:09,066 transformers.tokenization_utils_base 2452991 INFO loading file vocab.txt 2024-07-09 10:28:09,066 transformers.tokenization_utils_base 2452991 INFO loading file tokenizer.json 2024-07-09 10:28:09,066 transformers.tokenization_utils_base 2452991 INFO loading file added_tokens.json 2024-07-09 10:28:09,066 transformers.tokenization_utils_base 2452991 INFO loading file special_tokens_map.json 2024-07-09 10:28:09,066 transformers.tokenization_utils_base 2452991 INFO loading file tokenizer_config.json 2024-07-09 10:28:09,136 uvicorn.access 2411620 INFO 127.0.0.1:53944 - "POST /v1/models HTTP/1.1" 200 2024-07-09 10:30:32,739 uvicorn.access 2411620 INFO 127.0.0.1:55422 - "POST /v1/chat/completions HTTP/1.1" 200 2024-07-09 10:31:26,890 xinference.model.llm.pytorch.utils 2414138 INFO Average generation speed: 1.22 tokens/s. 2024-07-09 10:32:59,746 uvicorn.access 2411620 INFO 127.0.0.1:54866 - "POST /v1/chat/completions HTTP/1.1" 200 2024-07-09 10:33:19,966 xinference.model.llm.pytorch.utils 2414138 INFO Average generation speed: 1.34 tokens/s. 2024-07-09 10:56:17,257 uvicorn.access 2411620 INFO 127.0.0.1:54770 - "POST /v1/chat/completions HTTP/1.1" 200 2024-07-09 10:56:56,609 xinference.model.llm.pytorch.utils 2414138 INFO Average generation speed: 1.60 tokens/s.

ChengjieLi28 commented 2 months ago

@zhaozhizhuo debug日志打开再把日志贴一下

zhaozhizhuo commented 2 months ago

请问我该如何查看debug日志,是通过命令xinference-local --log-level debug再重新加载一遍模型吗

ChengjieLi28 commented 2 months ago

请问我该如何查看debug日志,是通过命令xinference-local --log-level debug再重新加载一遍模型吗

对,这样重新启动xinference,然后再launch一遍

zhaozhizhuo commented 2 months ago

您好,下面是开启debug后的日志 2024-07-09 11:20:08,972 xinference.core.supervisor 2506636 INFO Xinference supervisor 127.0.0.1:26637 started 2024-07-09 11:20:09,013 xinference.core.worker 2506636 INFO Starting metrics export server at 127.0.0.1:None 2024-07-09 11:20:09,016 xinference.core.worker 2506636 INFO Checking metrics export server... 2024-07-09 11:20:11,620 xinference.core.worker 2506636 INFO Metrics server is started at: http://127.0.0.1:37473 2024-07-09 11:20:11,621 xinference.core.supervisor 2506636 DEBUG Enter add_worker, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f41efd16a20>, '127.0.0.1:26637'), kwargs: {} 2024-07-09 11:20:11,621 xinference.core.supervisor 2506636 DEBUG Worker 127.0.0.1:26637 has been added successfully 2024-07-09 11:20:11,622 xinference.core.supervisor 2506636 DEBUG Leave add_worker, elapsed time: 0 s 2024-07-09 11:20:11,623 xinference.core.worker 2506636 INFO Xinference worker 127.0.0.1:26637 started 2024-07-09 11:20:11,625 xinference.core.worker 2506636 INFO Purge cache directory: /copydata2/zhaozhizhuo/.xinference/cache 2024-07-09 11:20:11,776 xinference.core.supervisor 2506636 DEBUG Worker 127.0.0.1:26637 resources: {'cpu': ResourceStatus(usage=0.0, total=56, memory_used=31152201728, memory_available=236104704000, memory_total=270351077376), 'gpu-0': GPUStatus(mem_total=25447170048, mem_free=25443368960, mem_used=3801088), 'gpu-1': GPUStatus(mem_total=25447170048, mem_free=10455023616, mem_used=14992146432), 'gpu-2': GPUStatus(mem_total=25447170048, mem_free=22837657600, mem_used=2609512448), 'gpu-3': GPUStatus(mem_total=25447170048, mem_free=23250796544, mem_used=2196373504), 'gpu-4': GPUStatus(mem_total=25447170048, mem_free=25443368960, mem_used=3801088), 'gpu-5': GPUStatus(mem_total=25447170048, mem_free=25443368960, mem_used=3801088), 'gpu-6': GPUStatus(mem_total=25447170048, mem_free=25443368960, mem_used=3801088), 'gpu-7': GPUStatus(mem_total=25447170048, mem_free=25443368960, mem_used=3801088), 'gpu-8': GPUStatus(mem_total=25447170048, mem_free=10599727104, mem_used=14847442944), 'gpu-9': GPUStatus(mem_total=25447170048, mem_free=25443368960, mem_used=3801088)} 2024-07-09 11:20:13,970 xinference.core.supervisor 2506636 DEBUG Enter get_status, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f41efd16a20>,), kwargs: {} 2024-07-09 11:20:13,971 xinference.core.supervisor 2506636 DEBUG Leave get_status, elapsed time: 0 s 2024-07-09 11:20:15,796 xinference.api.restful_api 2506450 INFO Starting Xinference at endpoint: http://127.0.0.1:9997 2024-07-09 11:20:15,980 uvicorn.error 2506450 INFO Started server process [2506450] 2024-07-09 11:20:15,980 uvicorn.error 2506450 INFO Waiting for application startup. 2024-07-09 11:20:15,980 uvicorn.error 2506450 INFO Application startup complete. 2024-07-09 11:20:15,981 uvicorn.error 2506450 INFO Uvicorn running on http://127.0.0.1:9997 (Press CTRL+C to quit) 2024-07-09 11:20:48,245 uvicorn.access 2506450 INFO 127.0.0.1:59930 - "GET /v1/cluster/auth HTTP/1.1" 200 2024-07-09 11:20:48,252 xinference.core.supervisor 2506636 DEBUG Enter launch_builtin_model, model_uid: qwen2-7b-langchain, model_name: qwen2-7b-langchain, model_size: , model_format: pytorch, quantization: None, replica: 1, kwargs: {'trust_remote_code': True} 2024-07-09 11:20:48,253 xinference.core.worker 2506636 DEBUG Enter get_model_count, args: (<xinference.core.worker.WorkerActor object at 0x7f41efd96e80>,), kwargs: {} 2024-07-09 11:20:48,253 xinference.core.worker 2506636 DEBUG Leave get_model_count, elapsed time: 0 s 2024-07-09 11:20:48,254 xinference.core.worker 2506636 DEBUG Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f41efd96e80>,), kwargs: {'model_uid': 'qwen2-7b-langchain-1-0', 'model_name': 'qwen2-7b-langchain', 'model_size_in_billions': None, 'model_format': 'pytorch', 'quantization': None, 'model_engine': 'Transformers', 'model_type': 'LLM', 'n_gpu': 'auto', 'request_limits': None, 'peft_model_config': None, 'gpu_idx': None, 'trust_remote_code': True} 2024-07-09 11:20:51,315 xinference.model.llm.core 2506636 DEBUG Launching qwen2-7b-langchain-1-0 with PytorchChatModel 2024-07-09 11:20:51,316 xinference.model.llm.llm_family 2506636 INFO Caching from URI: /copydata2/zhaozhizhuo/qwen2/ 2024-07-09 11:20:51,316 xinference.model.llm.llm_family 2506636 INFO Cache /copydata2/zhaozhizhuo/qwen2 exists 2024-07-09 11:20:51,469 transformers.tokenization_utils_base 2507562 INFO loading file vocab.json 2024-07-09 11:20:51,469 transformers.tokenization_utils_base 2507562 INFO loading file merges.txt 2024-07-09 11:20:51,469 transformers.tokenization_utils_base 2507562 INFO loading file tokenizer.json 2024-07-09 11:20:51,470 transformers.tokenization_utils_base 2507562 INFO loading file added_tokens.json 2024-07-09 11:20:51,470 transformers.tokenization_utils_base 2507562 INFO loading file special_tokens_map.json 2024-07-09 11:20:51,470 transformers.tokenization_utils_base 2507562 INFO loading file tokenizer_config.json 2024-07-09 11:20:51,689 transformers.tokenization_utils_base 2507562 WARNING Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-07-09 11:20:51,712 transformers.configuration_utils 2507562 INFO loading configuration file /copydata2/zhaozhizhuo/qwen2/config.json 2024-07-09 11:20:51,713 transformers.configuration_utils 2507562 INFO Model config Qwen2Config { "_name_or_path": "/copydata2/zhaozhizhuo/qwen2", "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151643, "hidden_act": "silu", "hidden_size": 3584, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 131072, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": 131072, "tie_word_embeddings": false, "torch_dtype": "float32", "transformers_version": "4.41.2", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 }

2024-07-09 11:20:51,877 transformers.modeling_utils 2507562 INFO loading weights file /copydata2/zhaozhizhuo/qwen2/model.safetensors.index.json 2024-07-09 11:20:51,877 transformers.modeling_utils 2507562 INFO Instantiating Qwen2ForCausalLM model under default dtype torch.float32. 2024-07-09 11:20:51,878 transformers.generation.configuration_utils 2507562 INFO Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151643 }

2024-07-09 11:20:55,552 transformers.modeling_utils 2507562 INFO All model checkpoint weights were used when initializing Qwen2ForCausalLM.

2024-07-09 11:20:55,552 transformers.modeling_utils 2507562 INFO All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /copydata2/zhaozhizhuo/qwen2. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training. 2024-07-09 11:20:55,555 transformers.generation.configuration_utils 2507562 INFO loading configuration file /copydata2/zhaozhizhuo/qwen2/generation_config.json 2024-07-09 11:20:55,555 transformers.generation.configuration_utils 2507562 INFO Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151643, "max_new_tokens": 2048 }

2024-07-09 11:20:55,561 xinference.model.llm.pytorch.core 2507562 DEBUG Model Memory: 34220569600 2024-07-09 11:20:55,562 xinference.core.worker 2506636 DEBUG Leave launch_builtin_model, elapsed time: 7 s 2024-07-09 11:20:55,564 uvicorn.access 2506450 INFO 127.0.0.1:59932 - "POST /v1/models HTTP/1.1" 200 2024-07-09 11:21:06,976 uvicorn.access 2506450 INFO 127.0.0.1:60038 - "GET /v1/cluster/auth HTTP/1.1" 200 2024-07-09 11:21:06,982 xinference.core.supervisor 2506636 DEBUG Enter launch_builtin_model, model_uid: bge-embedding-model, model_name: bge-embedding-model, model_size: , model_format: None, quantization: None, replica: 1, kwargs: {'trust_remote_code': True} 2024-07-09 11:21:06,984 xinference.core.worker 2506636 DEBUG Enter get_model_count, args: (<xinference.core.worker.WorkerActor object at 0x7f41efd96e80>,), kwargs: {} 2024-07-09 11:21:06,985 xinference.core.worker 2506636 DEBUG Leave get_model_count, elapsed time: 0 s 2024-07-09 11:21:06,985 xinference.core.worker 2506636 DEBUG Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f41efd96e80>,), kwargs: {'model_uid': 'bge-embedding-model-1-0', 'model_name': 'bge-embedding-model', 'model_size_in_billions': None, 'model_format': None, 'quantization': None, 'model_engine': None, 'model_type': 'embedding', 'n_gpu': 'auto', 'request_limits': None, 'peft_model_config': None, 'gpu_idx': None, 'trust_remote_code': True} 2024-07-09 11:21:09,958 xinference.model.utils 2506636 INFO Model caching from URI: /copydata2/zhaozhizhuo/bgemodel/ 2024-07-09 11:21:09,958 xinference.model.utils 2506636 INFO cache /copydata2/zhaozhizhuo/bgemodel exists 2024-07-09 11:21:11,280 transformers.configuration_utils 2508196 INFO loading configuration file /copydata2/zhaozhizhuo/bgemodel/config.json 2024-07-09 11:21:11,280 transformers.dynamic_module_utils 2508196 INFO Patched resolve_trust_remote_code: (False, '/copydata2/zhaozhizhuo/bgemodel', True, False) {} 2024-07-09 11:21:11,281 transformers.configuration_utils 2508196 INFO Model config BertConfig { "_name_or_path": "/copydata2/zhaozhizhuo/bgemodel", "architectures": [ "BertModel" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "classifier_dropout": null, "directionality": "bidi", "eos_token_id": 2, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "id2label": { "0": "LABEL_0" }, "initializer_range": 0.02, "intermediate_size": 4096, "label2id": { "LABEL_0": 0 }, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 16, "num_hidden_layers": 24, "output_past": true, "pad_token_id": 0, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "position_embedding_type": "absolute", "torch_dtype": "float32", "transformers_version": "4.41.2", "type_vocab_size": 2, "use_cache": true, "vocab_size": 21128 }

2024-07-09 11:21:11,352 transformers.dynamic_module_utils 2508196 INFO Patched resolve_trust_remote_code: (False, '/copydata2/zhaozhizhuo/bgemodel', True, False) {} 2024-07-09 11:21:11,356 transformers.modeling_utils 2508196 INFO loading weights file /copydata2/zhaozhizhuo/bgemodel/pytorch_model.bin 2024-07-09 11:21:11,657 transformers.modeling_utils 2508196 INFO All model checkpoint weights were used when initializing BertModel.

2024-07-09 11:21:11,657 transformers.modeling_utils 2508196 INFO All the weights of BertModel were initialized from the model checkpoint at /copydata2/zhaozhizhuo/bgemodel. If your task is similar to the task the model of the checkpoint was trained on, you can already use BertModel for predictions without further training. 2024-07-09 11:21:11,729 transformers.dynamic_module_utils 2508196 INFO Patched resolve_trust_remote_code: (False, '/copydata2/zhaozhizhuo/bgemodel', True, False) {} 2024-07-09 11:21:11,730 transformers.tokenization_utils_base 2508196 INFO loading file vocab.txt 2024-07-09 11:21:11,730 transformers.tokenization_utils_base 2508196 INFO loading file tokenizer.json 2024-07-09 11:21:11,730 transformers.tokenization_utils_base 2508196 INFO loading file added_tokens.json 2024-07-09 11:21:11,730 transformers.tokenization_utils_base 2508196 INFO loading file special_tokens_map.json 2024-07-09 11:21:11,730 transformers.tokenization_utils_base 2508196 INFO loading file tokenizer_config.json 2024-07-09 11:21:11,759 xinference.core.worker 2506636 DEBUG Leave launch_builtin_model, elapsed time: 4 s 2024-07-09 11:21:11,760 uvicorn.access 2506450 INFO 127.0.0.1:60044 - "POST /v1/models HTTP/1.1" 200

ChengjieLi28 commented 2 months ago

您好,下面是开启debug后的日志 2024-07-09 11:20:08,972 xinference.core.supervisor 2506636 INFO Xinference supervisor 127.0.0.1:26637 started 2024-07-09 11:20:09,013 xinference.core.worker 2506636 INFO Starting metrics export server at 127.0.0.1:None 2024-07-09 11:20:09,016 xinference.core.worker 2506636 INFO Checking metrics export server... 2024-07-09 11:20:11,620 xinference.core.worker 2506636 INFO Metrics server is started at: http://127.0.0.1:37473 2024-07-09 11:20:11,621 xinference.core.supervisor 2506636 DEBUG Enter add_worker, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f41efd16a20>, '127.0.0.1:26637'), kwargs: {} 2024-07-09 11:20:11,621 xinference.core.supervisor 2506636 DEBUG Worker 127.0.0.1:26637 has been added successfully 2024-07-09 11:20:11,622 xinference.core.supervisor 2506636 DEBUG Leave add_worker, elapsed time: 0 s 2024-07-09 11:20:11,623 xinference.core.worker 2506636 INFO Xinference worker 127.0.0.1:26637 started 2024-07-09 11:20:11,625 xinference.core.worker 2506636 INFO Purge cache directory: /copydata2/zhaozhizhuo/.xinference/cache 2024-07-09 11:20:11,776 xinference.core.supervisor 2506636 DEBUG Worker 127.0.0.1:26637 resources: {'cpu': ResourceStatus(usage=0.0, total=56, memory_used=31152201728, memory_available=236104704000, memory_total=270351077376), 'gpu-0': GPUStatus(mem_total=25447170048, mem_free=25443368960, mem_used=3801088), 'gpu-1': GPUStatus(mem_total=25447170048, mem_free=10455023616, mem_used=14992146432), 'gpu-2': GPUStatus(mem_total=25447170048, mem_free=22837657600, mem_used=2609512448), 'gpu-3': GPUStatus(mem_total=25447170048, mem_free=23250796544, mem_used=2196373504), 'gpu-4': GPUStatus(mem_total=25447170048, mem_free=25443368960, mem_used=3801088), 'gpu-5': GPUStatus(mem_total=25447170048, mem_free=25443368960, mem_used=3801088), 'gpu-6': GPUStatus(mem_total=25447170048, mem_free=25443368960, mem_used=3801088), 'gpu-7': GPUStatus(mem_total=25447170048, mem_free=25443368960, mem_used=3801088), 'gpu-8': GPUStatus(mem_total=25447170048, mem_free=10599727104, mem_used=14847442944), 'gpu-9': GPUStatus(mem_total=25447170048, mem_free=25443368960, mem_used=3801088)} 2024-07-09 11:20:13,970 xinference.core.supervisor 2506636 DEBUG Enter get_status, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f41efd16a20>,), kwargs: {} 2024-07-09 11:20:13,971 xinference.core.supervisor 2506636 DEBUG Leave get_status, elapsed time: 0 s 2024-07-09 11:20:15,796 xinference.api.restful_api 2506450 INFO Starting Xinference at endpoint: http://127.0.0.1:9997 2024-07-09 11:20:15,980 uvicorn.error 2506450 INFO Started server process [2506450] 2024-07-09 11:20:15,980 uvicorn.error 2506450 INFO Waiting for application startup. 2024-07-09 11:20:15,980 uvicorn.error 2506450 INFO Application startup complete. 2024-07-09 11:20:15,981 uvicorn.error 2506450 INFO Uvicorn running on http://127.0.0.1:9997 (Press CTRL+C to quit) 2024-07-09 11:20:48,245 uvicorn.access 2506450 INFO 127.0.0.1:59930 - "GET /v1/cluster/auth HTTP/1.1" 200 2024-07-09 11:20:48,252 xinference.core.supervisor 2506636 DEBUG Enter launch_builtin_model, model_uid: qwen2-7b-langchain, model_name: qwen2-7b-langchain, model_size: , model_format: pytorch, quantization: None, replica: 1, kwargs: {'trust_remote_code': True} 2024-07-09 11:20:48,253 xinference.core.worker 2506636 DEBUG Enter get_model_count, args: (<xinference.core.worker.WorkerActor object at 0x7f41efd96e80>,), kwargs: {} 2024-07-09 11:20:48,253 xinference.core.worker 2506636 DEBUG Leave get_model_count, elapsed time: 0 s 2024-07-09 11:20:48,254 xinference.core.worker 2506636 DEBUG Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f41efd96e80>,), kwargs: {'model_uid': 'qwen2-7b-langchain-1-0', 'model_name': 'qwen2-7b-langchain', 'model_size_in_billions': None, 'model_format': 'pytorch', 'quantization': None, 'model_engine': 'Transformers', 'model_type': 'LLM', 'n_gpu': 'auto', 'request_limits': None, 'peft_model_config': None, 'gpu_idx': None, 'trust_remote_code': True} 2024-07-09 11:20:51,315 xinference.model.llm.core 2506636 DEBUG Launching qwen2-7b-langchain-1-0 with PytorchChatModel 2024-07-09 11:20:51,316 xinference.model.llm.llm_family 2506636 INFO Caching from URI: /copydata2/zhaozhizhuo/qwen2/ 2024-07-09 11:20:51,316 xinference.model.llm.llm_family 2506636 INFO Cache /copydata2/zhaozhizhuo/qwen2 exists 2024-07-09 11:20:51,469 transformers.tokenization_utils_base 2507562 INFO loading file vocab.json 2024-07-09 11:20:51,469 transformers.tokenization_utils_base 2507562 INFO loading file merges.txt 2024-07-09 11:20:51,469 transformers.tokenization_utils_base 2507562 INFO loading file tokenizer.json 2024-07-09 11:20:51,470 transformers.tokenization_utils_base 2507562 INFO loading file added_tokens.json 2024-07-09 11:20:51,470 transformers.tokenization_utils_base 2507562 INFO loading file special_tokens_map.json 2024-07-09 11:20:51,470 transformers.tokenization_utils_base 2507562 INFO loading file tokenizer_config.json 2024-07-09 11:20:51,689 transformers.tokenization_utils_base 2507562 WARNING Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-07-09 11:20:51,712 transformers.configuration_utils 2507562 INFO loading configuration file /copydata2/zhaozhizhuo/qwen2/config.json 2024-07-09 11:20:51,713 transformers.configuration_utils 2507562 INFO Model config Qwen2Config { "_name_or_path": "/copydata2/zhaozhizhuo/qwen2", "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151643, "hidden_act": "silu", "hidden_size": 3584, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 131072, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": 131072, "tie_word_embeddings": false, "torch_dtype": "float32", "transformers_version": "4.41.2", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 }

2024-07-09 11:20:51,877 transformers.modeling_utils 2507562 INFO loading weights file /copydata2/zhaozhizhuo/qwen2/model.safetensors.index.json 2024-07-09 11:20:51,877 transformers.modeling_utils 2507562 INFO Instantiating Qwen2ForCausalLM model under default dtype torch.float32. 2024-07-09 11:20:51,878 transformers.generation.configuration_utils 2507562 INFO Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151643 }

2024-07-09 11:20:55,552 transformers.modeling_utils 2507562 INFO All model checkpoint weights were used when initializing Qwen2ForCausalLM.

2024-07-09 11:20:55,552 transformers.modeling_utils 2507562 INFO All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /copydata2/zhaozhizhuo/qwen2. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training. 2024-07-09 11:20:55,555 transformers.generation.configuration_utils 2507562 INFO loading configuration file /copydata2/zhaozhizhuo/qwen2/generation_config.json 2024-07-09 11:20:55,555 transformers.generation.configuration_utils 2507562 INFO Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151643, "max_new_tokens": 2048 }

2024-07-09 11:20:55,561 xinference.model.llm.pytorch.core 2507562 DEBUG Model Memory: 34220569600 2024-07-09 11:20:55,562 xinference.core.worker 2506636 DEBUG Leave launch_builtin_model, elapsed time: 7 s 2024-07-09 11:20:55,564 uvicorn.access 2506450 INFO 127.0.0.1:59932 - "POST /v1/models HTTP/1.1" 200 2024-07-09 11:21:06,976 uvicorn.access 2506450 INFO 127.0.0.1:60038 - "GET /v1/cluster/auth HTTP/1.1" 200 2024-07-09 11:21:06,982 xinference.core.supervisor 2506636 DEBUG Enter launch_builtin_model, model_uid: bge-embedding-model, model_name: bge-embedding-model, model_size: , model_format: None, quantization: None, replica: 1, kwargs: {'trust_remote_code': True} 2024-07-09 11:21:06,984 xinference.core.worker 2506636 DEBUG Enter get_model_count, args: (<xinference.core.worker.WorkerActor object at 0x7f41efd96e80>,), kwargs: {} 2024-07-09 11:21:06,985 xinference.core.worker 2506636 DEBUG Leave get_model_count, elapsed time: 0 s 2024-07-09 11:21:06,985 xinference.core.worker 2506636 DEBUG Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f41efd96e80>,), kwargs: {'model_uid': 'bge-embedding-model-1-0', 'model_name': 'bge-embedding-model', 'model_size_in_billions': None, 'model_format': None, 'quantization': None, 'model_engine': None, 'model_type': 'embedding', 'n_gpu': 'auto', 'request_limits': None, 'peft_model_config': None, 'gpu_idx': None, 'trust_remote_code': True} 2024-07-09 11:21:09,958 xinference.model.utils 2506636 INFO Model caching from URI: /copydata2/zhaozhizhuo/bgemodel/ 2024-07-09 11:21:09,958 xinference.model.utils 2506636 INFO cache /copydata2/zhaozhizhuo/bgemodel exists 2024-07-09 11:21:11,280 transformers.configuration_utils 2508196 INFO loading configuration file /copydata2/zhaozhizhuo/bgemodel/config.json 2024-07-09 11:21:11,280 transformers.dynamic_module_utils 2508196 INFO Patched resolve_trust_remote_code: (False, '/copydata2/zhaozhizhuo/bgemodel', True, False) {} 2024-07-09 11:21:11,281 transformers.configuration_utils 2508196 INFO Model config BertConfig { "_name_or_path": "/copydata2/zhaozhizhuo/bgemodel", "architectures": [ "BertModel" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "classifier_dropout": null, "directionality": "bidi", "eos_token_id": 2, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "id2label": { "0": "LABEL_0" }, "initializer_range": 0.02, "intermediate_size": 4096, "label2id": { "LABEL_0": 0 }, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 16, "num_hidden_layers": 24, "output_past": true, "pad_token_id": 0, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "position_embedding_type": "absolute", "torch_dtype": "float32", "transformers_version": "4.41.2", "type_vocab_size": 2, "use_cache": true, "vocab_size": 21128 }

2024-07-09 11:21:11,352 transformers.dynamic_module_utils 2508196 INFO Patched resolve_trust_remote_code: (False, '/copydata2/zhaozhizhuo/bgemodel', True, False) {} 2024-07-09 11:21:11,356 transformers.modeling_utils 2508196 INFO loading weights file /copydata2/zhaozhizhuo/bgemodel/pytorch_model.bin 2024-07-09 11:21:11,657 transformers.modeling_utils 2508196 INFO All model checkpoint weights were used when initializing BertModel.

2024-07-09 11:21:11,657 transformers.modeling_utils 2508196 INFO All the weights of BertModel were initialized from the model checkpoint at /copydata2/zhaozhizhuo/bgemodel. If your task is similar to the task the model of the checkpoint was trained on, you can already use BertModel for predictions without further training. 2024-07-09 11:21:11,729 transformers.dynamic_module_utils 2508196 INFO Patched resolve_trust_remote_code: (False, '/copydata2/zhaozhizhuo/bgemodel', True, False) {} 2024-07-09 11:21:11,730 transformers.tokenization_utils_base 2508196 INFO loading file vocab.txt 2024-07-09 11:21:11,730 transformers.tokenization_utils_base 2508196 INFO loading file tokenizer.json 2024-07-09 11:21:11,730 transformers.tokenization_utils_base 2508196 INFO loading file added_tokens.json 2024-07-09 11:21:11,730 transformers.tokenization_utils_base 2508196 INFO loading file special_tokens_map.json 2024-07-09 11:21:11,730 transformers.tokenization_utils_base 2508196 INFO loading file tokenizer_config.json 2024-07-09 11:21:11,759 xinference.core.worker 2506636 DEBUG Leave launch_builtin_model, elapsed time: 4 s 2024-07-09 11:21:11,760 uvicorn.access 2506450 INFO 127.0.0.1:60044 - "POST /v1/models HTTP/1.1" 200

2024-07-09 11:20:48,252 xinference.core.supervisor 2506636 DEBUG Enter launch_builtin_model, model_uid: qwen2-7b-langchain, model_name: qwen2-7b-langchain, model_size: , model_format: pytorch, quantization: None, replica: 1, kwargs: {'trust_remote_code': True}
2024-07-09 11:20:48,253 xinference.core.worker 2506636 DEBUG Enter get_model_count, args: (<xinference.core.worker.WorkerActor object at 0x7f41efd96e80>,), kwargs: {}
2024-07-09 11:20:48,253 xinference.core.worker 2506636 DEBUG Leave get_model_count, elapsed time: 0 s
2024-07-09 11:20:48,254 xinference.core.worker 2506636 DEBUG Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f41efd96e80>,), kwargs: {'model_uid': 'qwen2-7b-langchain-1-0', 'model_name': 'qwen2-7b-langchain', 'model_size_in_billions': None, 'model_format': 'pytorch', 'quantization': None, 'model_engine': 'Transformers', 'model_type': 'LLM', 'n_gpu': 'auto', 'request_limits': None, 'peft_model_config': None, 'gpu_idx': None, 'trust_remote_code': True}
2024-07-09 11:20:51,315 xinference.model.llm.core 2506636 DEBUG Launching qwen2-7b-langchain-1-0 with PytorchChatModel

这一部分日志有缺失,传到worker参数n_gpu是auto,但是缺少GPU selected开头的相关日志。包括下面的embedding也是一样。 感觉工程文件有问题,被改过了? 怎么安装的xinference?怎么启动的xinference?

zhaozhizhuo commented 2 months ago

安装xinference是pip install "xinference[transformers]" -i https://pypi.tuna.tsinghua.edu.cn/simple。启动xinference是xinference-local -H 0.0.0.0 。

ChengjieLi28 commented 2 months ago

你正常使用torch cuda有问题吗?

import torch
torch.cuda.is_available()
torch.cuda.device_count()

你的环境不知道有什么特别之处,正常复现不出来,launch过程要打的日志没打出来。 你使用过程中看下nvidia-smi有显存占用吗?

ChengjieLi28 commented 2 months ago

如果没什么问题,建议全新conda 环境重装xinference再试试

zhaozhizhuo commented 2 months ago

我已经试过使用全新的conda环境进行部署了,但是不幸的是问题还是存在。我使用的是python=3.10我的cuda版本是11.4.请问还有什么包需要特殊的版本吗

ChengjieLi28 commented 2 months ago

我已经试过使用全新的conda环境进行部署了,但是不幸的是问题还是存在。我使用的是python=3.10我的cuda版本是11.4.请问还有什么包需要特殊的版本吗

建议升级cuda,11.4太低了,下面这三句你能运行出来结果吗?

import torch
torch.cuda.is_available()
torch.cuda.device_count()
zhaozhizhuo commented 2 months ago

可以运行出来结果。除了升级CUDA还有其他办法吗,因为服务器不是个人的,升级会引起非常多的连锁问题 Python 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import torch

torch.cuda.is_available() /copydata2/zhaozhizhuo/anaconda3/envs/langchain-xinference/lib/python3.10/site-packages/torch/cuda/init.py:141: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 False torch.cuda.device_count() 10

ChengjieLi28 commented 2 months ago

可以运行出来结果。除了升级CUDA还有其他办法吗,因为服务器不是个人的,升级会引起非常多的连锁问题 Python 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import torch

torch.cuda.is_available() /copydata2/zhaozhizhuo/anaconda3/envs/langchain-xinference/lib/python3.10/site-packages/torch/cuda/init.py:141: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 False torch.cuda.device_count() 10

torch.cuda.is_available() 这句返回False,那就说明,你连torch都正常用不了,这已经跟xinference无关了。11.4属实太老,xinference需要的torch版本是不兼容这个cuda版本的。

zhaozhizhuo commented 2 months ago

好的,多谢啦

pimooook commented 2 weeks ago

cuda 是可以同时存在多个版本的,只要你的nvidia驱动至少550,你可以自己下一个12.4,改环境变量即可,不会影响其他人