cuda: Flash attention 2 is not supported

JulianKropp commented 2 months ago

I just run the commands using the docker cuda version. It started, but when trying to translate text using the example in swagger it throws this error. The language detection works fine

docker build -f Dockerfile.cuda-build -t nllb-api .

docker run --rm --gpus all \
  -e SERVER_PORT=7860 \
  -p 7860:7860 \
  nllb-api

$ docker run --rm --gpus all   -e SERVER_PORT=7860   -p 7860:7860   nllb-api
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
[INFO] Starting granian (main PID: 1)
[INFO] Listening at: http://0.0.0.0:7860
[INFO] Spawning worker-1 with pid: 23
Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 41165.47it/s]
/usr/local/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 25540.42it/s]
[INFO] Started worker-1
[INFO] Started worker-1 runtime-1
[2024-09-13 18:22:32 +0000] 200 "GET /schema/swagger HTTP/1.1" 10.0.3.1 in 7.29 ms
Application Exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/litestar/middleware/_internal/exceptions/middleware.py", line 159, in __call__
    await self.app(scope, receive, capture_response_started)
  File "/usr/local/lib/python3.12/site-packages/litestar/routes/http.py", line 80, in handle
    response = await self._get_response_for_request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/litestar/routes/http.py", line 132, in _get_response_for_request
    return await self._call_handler_function(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/litestar/routes/http.py", line 152, in _call_handler_function
    response_data, cleanup_group = await self._get_response_data(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/litestar/routes/http.py", line 205, in _get_response_data
    data = await route_handler.fn(**parsed_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/app/server/api/v3/translate.py", line 57, in translate_get
    return Translated(result=await TranslatorPool.translate(text, source, target))
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/app/server/features/translator.py", line 130, in translate
    return await wrap_future(
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/app/server/features/translator.py", line 69, in translate
    results = self.translator.translate_batch(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Flash attention 2 is not supported
[2024-09-13 18:22:55 +0000] 500 "GET /v3/translate HTTP/1.1" 10.0.3.1 in 159.08 ms

$ curl 'http://127.0.0.1:7860/api/v3/translate?text=Hello&source=eng_Latn&target=spa_Latn'
{"detail":"Internal Server Error"}

winstxnhdw commented 2 months ago

Right.. CTranslate2 just removed Flash Attention. I’ll fix it in a bit.

winstxnhdw commented 2 months ago

Fixed in latest.

JulianKropp commented 2 months ago

You are fast. Thanks

winstxnhdw / nllb-api

cuda: Flash attention 2 is not supported #220