Handling multiple query at same time

Hi, how to handle multiple request by your fast api? when i send 2 request to the: 0.0.0.0:8001/v1/completions at the same time, service go down ((((

Error:

Llama.generate: prefix-match hit
Llama.generate: prefix-match hit
CUDA error: invalid argument
  current device: 0, in function ggml_cuda_op_flatten at /tmp/pip-install-j25d97x1/llama-cpp-python_a5fa4aaf042f4d4abc70877705337ac9/vendor/llama.cpp/ggml-cuda.cu:8814
  ggml_cuda_cpy_tensor_2d(src1_ddf, src1, 0, 0, 0, nrows1, main_stream)
GGML_ASSERT: /tmp/pip-install-j25d97x1/llama-cpp-python_a5fa4aaf042f4d4abc70877705337ac9/vendor/llama.cpp/ggml-cuda.cu:237: !"CUDA error"
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
make: *** [Makefile:36: run] Aborted (core dumped)

by looking your ui code (ui.py), i notice that you are handle request in queue by get_ui_blocks method, one by one.

zylon-ai / private-gpt

Handling multiple query at same time #1592

Is it possible handle multiple request?