Vectorize asynchronously with ThreadPoolExecutor

alexcannan commented 1 year ago

Over the past few months I've noticed that my k8s-deployed weaviate instance (relying on t2v-transformers) has randomly been responding with 500s. After some digging I found out that the transformers-inference pod was being killed due to failing liveness and readiness probes, which causes the node to be restarted. After taking a look at the code it seems that the "async" vectorizer was not indeed asynchronous, and each vectorization request can block the entire app. To fix this, I've wrapped each vectorization request in a ThreadPoolExecutor future. This should strictly improve the availability of liveness and readiness probes, and alleviate this restarting issue. On top of that, the executor also allows tasks to minimize IO between tasks, which gives about a 3x speed boost on my machine :rocket:

A screenshot of my stress test before: t2v-stresstest

And after: 2023-04-21_09-27

And here is a link to the stress test I used, in case anyone wants to try it out themselves.

StefanBogdan commented 1 year ago

Maybe we could just run uvicorn with multiple workers: https://www.uvicorn.org/settings/#production

alexcannan commented 1 year ago

Maybe we could just run uvicorn with multiple workers: https://www.uvicorn.org/settings/#production

Running multiple workers right now will instantiate multiple models, which is great for speed, but will increase gpu memory usage, see my nvidia-smi output with --workers 4:

alex@hq3 ~> nvidia-smi
Fri Apr 21 11:12:54 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:08:00.0  On |                  N/A |
| 51%   72C    P2   343W / 350W |   9962MiB / 24576MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3291      G   /usr/lib/xorg/Xorg                442MiB |
|    0   N/A  N/A      3442      G   /usr/bin/gnome-shell               66MiB |
|    0   N/A  N/A      4425      G   ...veSuggestionsOnlyOnDemand       82MiB |
|    0   N/A  N/A      6918      G   ...RendererForSitePerProcess      122MiB |
|    0   N/A  N/A      8204      G   ...RendererForSitePerProcess       26MiB |
|    0   N/A  N/A      9777      G   ...9/usr/lib/firefox/firefox      232MiB |
|    0   N/A  N/A     50710      C   ...rs-models/env/bin/python3     2158MiB |
|    0   N/A  N/A     50711      C   ...rs-models/env/bin/python3     2224MiB |
|    0   N/A  N/A     50712      C   ...rs-models/env/bin/python3     2292MiB |
|    0   N/A  N/A     50713      C   ...rs-models/env/bin/python3     2308MiB |
+-----------------------------------------------------------------------------+

weaviate / t2v-transformers-models

Vectorize asynchronously with ThreadPoolExecutor #56