Closed alexcannan closed 1 year ago
Maybe we could just run uvicorn
with multiple workers: https://www.uvicorn.org/settings/#production
Maybe we could just run
uvicorn
with multiple workers: https://www.uvicorn.org/settings/#production
Running multiple workers right now will instantiate multiple models, which is great for speed, but will increase gpu memory usage, see my nvidia-smi
output with --workers 4
:
alex@hq3 ~> nvidia-smi
Fri Apr 21 11:12:54 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:08:00.0 On | N/A |
| 51% 72C P2 343W / 350W | 9962MiB / 24576MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3291 G /usr/lib/xorg/Xorg 442MiB |
| 0 N/A N/A 3442 G /usr/bin/gnome-shell 66MiB |
| 0 N/A N/A 4425 G ...veSuggestionsOnlyOnDemand 82MiB |
| 0 N/A N/A 6918 G ...RendererForSitePerProcess 122MiB |
| 0 N/A N/A 8204 G ...RendererForSitePerProcess 26MiB |
| 0 N/A N/A 9777 G ...9/usr/lib/firefox/firefox 232MiB |
| 0 N/A N/A 50710 C ...rs-models/env/bin/python3 2158MiB |
| 0 N/A N/A 50711 C ...rs-models/env/bin/python3 2224MiB |
| 0 N/A N/A 50712 C ...rs-models/env/bin/python3 2292MiB |
| 0 N/A N/A 50713 C ...rs-models/env/bin/python3 2308MiB |
+-----------------------------------------------------------------------------+
Over the past few months I've noticed that my k8s-deployed weaviate instance (relying on t2v-transformers) has randomly been responding with 500s. After some digging I found out that the transformers-inference pod was being killed due to failing liveness and readiness probes, which causes the node to be restarted. After taking a look at the code it seems that the "async" vectorizer was not indeed asynchronous, and each vectorization request can block the entire app. To fix this, I've wrapped each vectorization request in a ThreadPoolExecutor future. This should strictly improve the availability of liveness and readiness probes, and alleviate this restarting issue. On top of that, the executor also allows tasks to minimize IO between tasks, which gives about a 3x speed boost on my machine :rocket:
A screenshot of my stress test before:
And after:
And here is a link to the stress test I used, in case anyone wants to try it out themselves.