Hi, how to handle multiple request by your fast api?
when i send 2 request to the: 0.0.0.0:8001/v1/completions at the same time, service go down ((((
Error:
Llama.generate: prefix-match hit
Llama.generate: prefix-match hit
CUDA error: invalid argument
current device: 0, in function ggml_cuda_op_flatten at /tmp/pip-install-j25d97x1/llama-cpp-python_a5fa4aaf042f4d4abc70877705337ac9/vendor/llama.cpp/ggml-cuda.cu:8814
ggml_cuda_cpy_tensor_2d(src1_ddf, src1, 0, 0, 0, nrows1, main_stream)
GGML_ASSERT: /tmp/pip-install-j25d97x1/llama-cpp-python_a5fa4aaf042f4d4abc70877705337ac9/vendor/llama.cpp/ggml-cuda.cu:237: !"CUDA error"
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
make: *** [Makefile:36: run] Aborted (core dumped)
by looking your ui code (ui.py), i notice that you are handle request in queue by get_ui_blocks method, one by one.
Hi, how to handle multiple request by your fast api? when i send 2 request to the:
0.0.0.0:8001/v1/completions
at the same time, service go down ((((Error:
by looking your ui code (ui.py), i notice that you are handle request in queue by
get_ui_blocks
method, one by one.Is it possible handle multiple request?