weaviate / t2v-transformers-models

This is the repo for the container that holds the models for the text2vec-transformers module
BSD 3-Clause "New" or "Revised" License
40 stars 27 forks source link

arm64 not able to detect CUDA #77

Closed yuliyantsvetkov closed 7 months ago

yuliyantsvetkov commented 7 months ago

While trying to run the weaviate helm chart with the text2vec-transformers on Jetson Xavier NX with the latest JetPack I got this from the nvidia-container engine:

k logs -f transformers-inference-845fd6bf68-cznmv
INFO:     Started server process [19]
INFO:     Waiting for application startup.
INFO:     CUDA_PER_PROCESS_MEMORY_FRACTION set to 1.0
INFO:     CUDA_CORE set to cuda:0
ERROR:    Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 734, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 610, in __aenter__
    await self._router.startup()
  File "/usr/local/lib/python3.11/site-packages/starlette/routing.py", line 713, in startup
    handler()
  File "/app/app.py", line 75, in startup_event
    vec = Vectorizer(model_dir, cuda_support, cuda_core, cuda_per_process_memory_fraction,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/vectorizer.py", line 49, in __init__
    self.vectorizer = HuggingFaceVectorizer(model_path, cuda_support, cuda_core, cuda_per_process_memory_fraction, model_type, architecture, direct_tokenize)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/vectorizer.py", line 121, in __init__
    self.model.to(self.cuda_core)
  File "/usr/local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2556, in to
    return super().to(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

ERROR:    Application startup failed. Exiting.

That k3s node is absolutely working and I do doubt that the version of torch is not being able to use the tegra libraries.

Tried couple of ld libs configs, but without luck:

    envconfig:
      # enable for CUDA support. Your K8s cluster needs to be configured
      # accordingly and you need to explicitly set GPU requests & limits below
      enable_cuda: true

      # only used when CUDA is enabled
      nvidia_visible_devices: all
      nvidia_driver_capabilities: compute,utility

      # only used when CUDA is enabled
      #ld_library_path: /usr/local/nvidia/lib64
      #ld_library_path: /usr/local/cuda/lib64
      ld_library_path: /usr/lib/aarch64-linux-gnu/tegra

nvdp-nvidia-device-plugin is running without issues:

I0326 05:19:27.039676       1 main.go:154] Starting FS watcher.
I0326 05:19:27.040261       1 main.go:161] Starting OS watcher.
I0326 05:19:27.042011       1 main.go:176] Starting Plugins.
I0326 05:19:27.042112       1 main.go:234] Loading configuration.
I0326 05:19:27.042985       1 main.go:242] Updating config with default resource matching patterns.
I0326 05:19:27.044077       1 main.go:253]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0326 05:19:27.044153       1 main.go:256] Retreiving plugins.
W0326 05:19:27.046699       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0326 05:19:27.047010       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0326 05:19:27.048690       1 factory.go:107] Detected Tegra platform: /sys/devices/soc0/family has 'tegra' prefix
I0326 05:19:27.050143       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0326 05:19:27.057261       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0326 05:19:27.072596       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet

Shall I build new image with the JetPack included in order to make torch to detect the CUDA ?

yuliyantsvetkov commented 7 months ago

Fixed by building new image for arm64 with a base from NVIDIA: nvcr.io/nvidia/l4t-pytorch:r35.2.1-pth2.0-py3

INFO:     Started server process [20]
INFO:     Waiting for application startup.
INFO:     CUDA_PER_PROCESS_MEMORY_FRACTION set to 1.0
INFO:     CUDA_CORE set to cuda:0
/usr/local/lib/python3.8/dist-packages/torch/storage.py:315: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.
  warnings.warn(message, UserWarning)
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

I can create a branch if anyone is interested having the GPU interferences for weaviate on Jetson.