Openai style api for open large language models, using LLMs just as chatgpt! Support for LLaMA, LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, Xverse, SqlCoder, CodeLLaMA, ChatGLM, ChatGLM2, ChatGLM3 etc. 开源大模型的统一后端接口
提交前必须检查以下项目 | The following items must be checked before submission
[X] 请确保使用的是仓库最新代码(git pull),一些问题已被解决和修复。 | Make sure you are using the latest code from the repository (git pull), some issues have already been addressed and fixed.
[X] 我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索,没有找到相似问题和解决方案 | I have searched the existing issues / discussions
问题类型 | Type of problem
其他问题 | Other issues
操作系统 | Operating system
None
详细描述问题 | Detailed description of the problem
感谢非常棒的工作:
但是我找不到图标
lc@lc-ConceptD-CT500-51A:~/work/api-for-open-llm$ python3 server.py
2024-04-19 16:18:05.953 | DEBUG | api.config::338 - SETTINGS: {
"embedding_name": "/home/lc/work/QAnything/netease-youdao/bce-embedding-base_v1",
"rerank_name": "/home/lc/work/QAnything/netease-youdao/bce-reranker-base_v1",
"embedding_size": -1,
"embedding_device": "cuda:0",
"rerank_device": "cuda:0",
"model_name": "qwen",
"model_path": "/media/lc/lc/Qwen-1_8B-Chat",
"dtype": "half",
"load_in_8bit": false,
"load_in_4bit": false,
"context_length": -1,
"chat_template": null,
"rope_scaling": null,
"flash_attn": false,
"use_streamer_v2": false,
"interrupt_requests": true,
"host": "0.0.0.0",
"port": 8090,
"api_prefix": "/v1",
"engine": "default",
"tasks": [
"llm",
"rag"
],
"device_map": "auto",
"gpus": "0",
"num_gpus": 1,
"activate_inference": true,
"model_names": [
"qwen",
"bce-embedding-base_v1",
"bce-reranker-base_v1"
],
"api_keys": null
}
/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
2024-04-19 16:18:10.411 | INFO | api.rag.models.rerank:init:51 - Loading from /home/lc/work/QAnything/netease-youdao/bce-reranker-base_v1.
2024-04-19 16:18:10.546 | INFO | api.rag.models.rerank:init:77 - Execute device: cuda:0; gpu num: 1; use fp16: False
2024-04-19 16:18:10.818 | INFO | api.adapter.patcher:patch_tokenizer:119 - Add eos token: <|endoftext|>
2024-04-19 16:18:10.819 | INFO | api.adapter.patcher:patch_tokenizer:126 - Add pad token: <|endoftext|>
Try importing flash-attention for faster inference...
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.20it/s]
2024-04-19 16:18:12.005 | INFO | api.models:create_hf_llm:81 - Using default engine
2024-04-19 16:18:12.006 | INFO | api.core.default:_check_construct_prompt:126 - Using Qwen Model for Chat!
INFO: Started server process [14744]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit)
INFO: 127.0.0.1:45580 - "GET /v1 HTTP/1.1" 404 Not Found
INFO: 127.0.0.1:45580 - "GET /favicon.ico HTTP/1.1" 404 Not Found
INFO: 127.0.0.1:45580 - "GET /v1 HTTP/1.1" 404 Not Found
INFO: 127.0.0.1:45580 - "GET /favicon.ico HTTP/1.1" 404 Not Found
INFO: 127.0.0.1:45580 - "GET /v1 HTTP/1.1" 404 Not Found
INFO: 127.0.0.1:45580 - "GET /favicon.ico HTTP/1.1" 404 Not Found
INFO: 127.0.0.1:45580 - "GET / HTTP/1.1" 404 Not Found
INFO: 127.0.0.1:45580 - "GET /favicon.ico HTTP/1.1" 404 Not Found
Dependencies
lc@lc-ConceptD-CT500-51A:~/work/api-for-open-llm$ python3 server.py
2024-04-19 16:18:05.953 | DEBUG | api.config::338 - SETTINGS: {
"embedding_name": "/home/lc/work/QAnything/netease-youdao/bce-embedding-base_v1",
"rerank_name": "/home/lc/work/QAnything/netease-youdao/bce-reranker-base_v1",
"embedding_size": -1,
"embedding_device": "cuda:0",
"rerank_device": "cuda:0",
"model_name": "qwen",
"model_path": "/media/lc/lc/Qwen-1_8B-Chat",
"dtype": "half",
"load_in_8bit": false,
"load_in_4bit": false,
"context_length": -1,
"chat_template": null,
"rope_scaling": null,
"flash_attn": false,
"use_streamer_v2": false,
"interrupt_requests": true,
"host": "0.0.0.0",
"port": 8090,
"api_prefix": "/v1",
"engine": "default",
"tasks": [
"llm",
"rag"
],
"device_map": "auto",
"gpus": "0",
"num_gpus": 1,
"activate_inference": true,
"model_names": [
"qwen",
"bce-embedding-base_v1",
"bce-reranker-base_v1"
],
"api_keys": null
}
/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
2024-04-19 16:18:10.411 | INFO | api.rag.models.rerank:init:51 - Loading from /home/lc/work/QAnything/netease-youdao/bce-reranker-base_v1.
2024-04-19 16:18:10.546 | INFO | api.rag.models.rerank:init:77 - Execute device: cuda:0; gpu num: 1; use fp16: False
2024-04-19 16:18:10.818 | INFO | api.adapter.patcher:patch_tokenizer:119 - Add eos token: <|endoftext|>
2024-04-19 16:18:10.819 | INFO | api.adapter.patcher:patch_tokenizer:126 - Add pad token: <|endoftext|>
Try importing flash-attention for faster inference...
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.20it/s]
2024-04-19 16:18:12.005 | INFO | api.models:create_hf_llm:81 - Using default engine
2024-04-19 16:18:12.006 | INFO | api.core.default:_check_construct_prompt:126 - Using Qwen Model for Chat!
INFO: Started server process [14744]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit)
INFO: 127.0.0.1:45580 - "GET /v1 HTTP/1.1" 404 Not Found
INFO: 127.0.0.1:45580 - "GET /favicon.ico HTTP/1.1" 404 Not Found
INFO: 127.0.0.1:45580 - "GET /v1 HTTP/1.1" 404 Not Found
INFO: 127.0.0.1:45580 - "GET /favicon.ico HTTP/1.1" 404 Not Found
INFO: 127.0.0.1:45580 - "GET /v1 HTTP/1.1" 404 Not Found
INFO: 127.0.0.1:45580 - "GET /favicon.ico HTTP/1.1" 404 Not Found
INFO: 127.0.0.1:45580 - "GET / HTTP/1.1" 404 Not Found
INFO: 127.0.0.1:45580 - "GET /favicon.ico HTTP/1.1" 404 Not Found
运行日志或截图 | Runtime logs or screenshots
# 请在此处粘贴运行日志
# Please paste the run log here
![Screenshot from 2024-04-19 16-21-15](https://github.com/xusenlinzy/api-for-open-llm/assets/3146209/5bed9a45-b7e9-42f1-a383-37237600c894)
提交前必须检查以下项目 | The following items must be checked before submission
问题类型 | Type of problem
其他问题 | Other issues
操作系统 | Operating system
None
详细描述问题 | Detailed description of the problem
感谢非常棒的工作: 但是我找不到图标 lc@lc-ConceptD-CT500-51A:~/work/api-for-open-llm$ python3 server.py 2024-04-19 16:18:05.953 | DEBUG | api.config::338 - SETTINGS: {
"embedding_name": "/home/lc/work/QAnything/netease-youdao/bce-embedding-base_v1",
"rerank_name": "/home/lc/work/QAnything/netease-youdao/bce-reranker-base_v1",
"embedding_size": -1,
"embedding_device": "cuda:0",
"rerank_device": "cuda:0",
"model_name": "qwen",
"model_path": "/media/lc/lc/Qwen-1_8B-Chat",
"dtype": "half",
"load_in_8bit": false,
"load_in_4bit": false,
"context_length": -1,
"chat_template": null,
"rope_scaling": null,
"flash_attn": false,
"use_streamer_v2": false,
"interrupt_requests": true,
"host": "0.0.0.0",
"port": 8090,
"api_prefix": "/v1",
"engine": "default",
"tasks": [
"llm",
"rag"
],
"device_map": "auto",
"gpus": "0",
"num_gpus": 1,
"activate_inference": true,
"model_names": [
"qwen",
"bce-embedding-base_v1",
"bce-reranker-base_v1"
],
"api_keys": null
}
/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
2024-04-19 16:18:10.411 | INFO | api.rag.models.rerank:init:51 - Loading from
/home/lc/work/QAnything/netease-youdao/bce-reranker-base_v1
. 2024-04-19 16:18:10.546 | INFO | api.rag.models.rerank:init:77 - Execute device: cuda:0; gpu num: 1; use fp16: False 2024-04-19 16:18:10.818 | INFO | api.adapter.patcher:patch_tokenizer:119 - Add eos token: <|endoftext|> 2024-04-19 16:18:10.819 | INFO | api.adapter.patcher:patch_tokenizer:126 - Add pad token: <|endoftext|> Try importing flash-attention for faster inference... Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.20it/s] 2024-04-19 16:18:12.005 | INFO | api.models:create_hf_llm:81 - Using default engine 2024-04-19 16:18:12.006 | INFO | api.core.default:_check_construct_prompt:126 - Using Qwen Model for Chat! INFO: Started server process [14744] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) INFO: 127.0.0.1:45580 - "GET /v1 HTTP/1.1" 404 Not Found INFO: 127.0.0.1:45580 - "GET /favicon.ico HTTP/1.1" 404 Not Found INFO: 127.0.0.1:45580 - "GET /v1 HTTP/1.1" 404 Not Found INFO: 127.0.0.1:45580 - "GET /favicon.ico HTTP/1.1" 404 Not Found INFO: 127.0.0.1:45580 - "GET /v1 HTTP/1.1" 404 Not Found INFO: 127.0.0.1:45580 - "GET /favicon.ico HTTP/1.1" 404 Not Found INFO: 127.0.0.1:45580 - "GET / HTTP/1.1" 404 Not Found INFO: 127.0.0.1:45580 - "GET /favicon.ico HTTP/1.1" 404 Not FoundDependencies
lc@lc-ConceptD-CT500-51A:~/work/api-for-open-llm$ python3 server.py 2024-04-19 16:18:05.953 | DEBUG | api.config::338 - SETTINGS: {
"embedding_name": "/home/lc/work/QAnything/netease-youdao/bce-embedding-base_v1",
"rerank_name": "/home/lc/work/QAnything/netease-youdao/bce-reranker-base_v1",
"embedding_size": -1,
"embedding_device": "cuda:0",
"rerank_device": "cuda:0",
"model_name": "qwen",
"model_path": "/media/lc/lc/Qwen-1_8B-Chat",
"dtype": "half",
"load_in_8bit": false,
"load_in_4bit": false,
"context_length": -1,
"chat_template": null,
"rope_scaling": null,
"flash_attn": false,
"use_streamer_v2": false,
"interrupt_requests": true,
"host": "0.0.0.0",
"port": 8090,
"api_prefix": "/v1",
"engine": "default",
"tasks": [
"llm",
"rag"
],
"device_map": "auto",
"gpus": "0",
"num_gpus": 1,
"activate_inference": true,
"model_names": [
"qwen",
"bce-embedding-base_v1",
"bce-reranker-base_v1"
],
"api_keys": null
}
/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
2024-04-19 16:18:10.411 | INFO | api.rag.models.rerank:init:51 - Loading from
/home/lc/work/QAnything/netease-youdao/bce-reranker-base_v1
. 2024-04-19 16:18:10.546 | INFO | api.rag.models.rerank:init:77 - Execute device: cuda:0; gpu num: 1; use fp16: False 2024-04-19 16:18:10.818 | INFO | api.adapter.patcher:patch_tokenizer:119 - Add eos token: <|endoftext|> 2024-04-19 16:18:10.819 | INFO | api.adapter.patcher:patch_tokenizer:126 - Add pad token: <|endoftext|> Try importing flash-attention for faster inference... Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.20it/s] 2024-04-19 16:18:12.005 | INFO | api.models:create_hf_llm:81 - Using default engine 2024-04-19 16:18:12.006 | INFO | api.core.default:_check_construct_prompt:126 - Using Qwen Model for Chat! INFO: Started server process [14744] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) INFO: 127.0.0.1:45580 - "GET /v1 HTTP/1.1" 404 Not Found INFO: 127.0.0.1:45580 - "GET /favicon.ico HTTP/1.1" 404 Not Found INFO: 127.0.0.1:45580 - "GET /v1 HTTP/1.1" 404 Not Found INFO: 127.0.0.1:45580 - "GET /favicon.ico HTTP/1.1" 404 Not Found INFO: 127.0.0.1:45580 - "GET /v1 HTTP/1.1" 404 Not Found INFO: 127.0.0.1:45580 - "GET /favicon.ico HTTP/1.1" 404 Not Found INFO: 127.0.0.1:45580 - "GET / HTTP/1.1" 404 Not Found INFO: 127.0.0.1:45580 - "GET /favicon.ico HTTP/1.1" 404 Not Found运行日志或截图 | Runtime logs or screenshots