xorbitsai / xorbits

Scalable Python DS & ML, in an API compatible & lightning fast way.
https://xorbits.readthedocs.io
Apache License 2.0
1.1k stars 67 forks source link

BUG: GPU init error in my conda environment #680

Closed luweizheng closed 11 months ago

luweizheng commented 1 year ago

Describe the bug

I create a environment via conda install and install the requirements. When I run xorbits.init() in a GPU node. I get the following error.

Python: 3.8.16 xorbits: 0.5.2 cupy: 11.6.0 cudf: 23.04.01 numba: 0.56.4

Traceback (most recent call last):
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 247, in ensure_initialized
    self.cuInit(0)
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 320, in safe_cuda_api_call
    self._check_ctypes_error(fname, retcode)
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 388, in _check_ctypes_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [100] Call to cuInit results in CUDA_ERROR_NO_DEVICE

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/xoscar/backends/indigen/pool.py", line 256, in _start_sub_pool
    asyncio.run(coro)
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/xoscar/backends/indigen/pool.py", line 271, in _create_sub_pool
    pool = await SubActorPool.create(
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/xoscar/backends/pool.py", line 793, in create
    TypeDispatcher.reload_all_lazy_handlers()
  File "xoscar/_utils.pyx", line 110, in xoscar._utils.TypeDispatcher.reload_all_lazy_handlers
  File "xoscar/_utils.pyx", line 69, in xoscar._utils.TypeDispatcher._reload_lazy_handlers
  File "xoscar/_utils.pyx", line 74, in xoscar._utils.TypeDispatcher._reload_lazy_handlers
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/cudf/__init__.py", line 21, in <module>
    from cudf.core.algorithms import factorize
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/cudf/core/algorithms.py", line 9, in <module>
    from cudf.core.indexed_frame import IndexedFrame
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/cudf/core/indexed_frame.py", line 57, in <module>
    from cudf.core.groupby.groupby import GroupBy
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/cudf/core/groupby/__init__.py", line 3, in <module>
    from cudf.core.groupby.groupby import GroupBy, Grouper
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/cudf/core/groupby/groupby.py", line 28, in <module>
    from cudf.core.udf.groupby_utils import jit_groupby_apply
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/cudf/core/udf/groupby_utils.py", line 11, in <module>
    import cudf.core.udf.utils
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/cudf/core/udf/utils.py", line 121, in <module>
    _PTX_FILE = _get_ptx_file(os.path.dirname(__file__), "shim_")
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/cudf/core/udf/utils.py", line 87, in _get_ptx_file
    dev = cuda.get_current_device()
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/numba/cuda/api.py", line 435, in get_current_device
    return current_context().device
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 220, in get_context
    return _runtime.get_or_create_context(devnum)
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 138, in get_or_create_context
    return self._get_or_create_context_uncached(devnum)
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 153, in _get_or_create_context_uncached
    with driver.get_active_context() as ac:
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 488, in __enter__
    driver.cuCtxGetCurrent(byref(hctx))
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 285, in __getattr__
    self.ensure_initialized()
  File "/fs/fast/u20200002/envs/ucxcu/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 251, in ensure_initialized
    raise CudaSupportError(f"Error at driver init: {description}")
numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: Call to cuInit results in CUDA_ERROR_NO_DEVICE (100)

Expected behavior

GPU program can run.

ChengjieLi28 commented 12 months ago

Hi, @luweizheng . Since rapidsai does not support python 3.8 now (https://docs.rapids.ai/install#selector), could you please try python 3.9? Here's my configuration and it can work on our machines.

cupy 11.6.0
cudf 22.10.1+2.gca9a422da9
numba 0.56.3

I recommend following the rapids ai official installation page (https://docs.rapids.ai/install#selector) to select your cuda version and install it.

luweizheng commented 12 months ago

@ChengjieLi28 I create a new environment and install python 3.10 cudf 23.08.

>>> import cupy as cp
>>> x_gpu = cp.array([1, 2, 3])
>>> l2_gpu = cp.linalg.norm(x_gpu)
>>> import cudf
>>> s = cudf.Series([1, 2, 3, None, 4])
>>> s
0       1
1       2
2       3
3    <NA>
4       4
dtype: int64

These cupy and cudf functions work.

import xorbits
xorbits.init()

does not work. Here is the full log. xorbits_cuda.log.

luweizheng commented 12 months ago

I installed rapids 22.10 and now Xorbits works.

mamba install -c rapidsai -c conda-forge -c nvidia cudf=22.10 python=3.10 cuda-version=11.7