GPU not utilized correctly, slower than CPU

rrydbirk commented 1 year ago

I'm running the same data side-by-side on 32 CPU node and a 12 CPU / 1 A100 GPU node. It seems the GPU node is ~1 s/it slower than the CPU node. Could you advise me on what I'm doing wrong?

For the GPU node:

/work/01_notebooks via 🅒 velovae 
[ 09:23:39 ] ➜  pip freeze
anndata==0.9.2
anyio==4.0.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
asttokens==2.4.0
async-lru==2.0.4
attrs==23.1.0
Babel==2.12.1
backcall==0.2.0
beautifulsoup4==4.12.2
bleach==6.0.0
certifi==2023.7.22
cffi==1.16.0
charset-normalizer==3.2.0
click==8.1.7
cmake==3.27.5
comm==0.1.4
contourpy==1.1.1
cycler==0.11.0
debugpy==1.8.0
decorator==5.1.1
defusedxml==0.7.1
executing==1.2.0
fastjsonschema==2.18.0
filelock==3.12.4
fonttools==4.42.1
fqdn==1.5.1
h5py==3.9.0
hnswlib==0.7.0
idna==3.4
igraph==0.10.8
ipykernel==6.25.2
ipython==8.15.0
ipython-genutils==0.2.0
ipywidgets==8.1.1
isoduration==20.11.0
jedi==0.19.0
Jinja2==3.1.2
joblib==1.3.2
json5==0.9.14
jsonpointer==2.4
jsonschema==4.19.1
jsonschema-specifications==2023.7.1
jupyter-contrib-core==0.4.2
jupyter-contrib-nbextensions==0.7.0
jupyter-events==0.7.0
jupyter-highlight-selected-word==0.2.0
jupyter-lsp==2.2.0
jupyter-nbextensions-configurator==0.6.3
jupyter_client==8.3.1
jupyter_core==5.3.2
jupyter_server==2.7.3
jupyter_server_terminals==0.4.4
jupyterlab==4.0.6
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.9
jupyterlab_server==2.25.0
kiwisolver==1.4.5
lit==17.0.1
llvmlite==0.41.0
loess==2.1.2
loompy==3.0.7
lxml==4.9.3
MarkupSafe==2.1.3
matplotlib==3.5.1
matplotlib-inline==0.1.6
mistune==3.0.1
mpmath==1.3.0
natsort==8.4.0
nbclient==0.8.0
nbconvert==7.8.0
nbformat==5.9.2
nest-asyncio==1.5.8
networkx==3.1
notebook==7.0.4
notebook_shim==0.2.3
numba==0.58.0
numpy==1.25.2
numpy-groupies==0.10.1
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
overrides==7.4.0
packaging==23.1
pandas==2.1.1
pandocfilters==1.5.0
parso==0.8.3
patsy==0.5.3
pexpect==4.8.0
pickleshare==0.7.5
Pillow==10.0.1
platformdirs==3.10.0
plotbin==3.1.5
prometheus-client==0.17.1
prompt-toolkit==3.0.39
psutil==5.9.5
ptyprocess==0.7.0
pure-eval==0.2.2
pycparser==2.21
Pygments==2.16.1
pynndescent==0.5.10
pyparsing==3.1.1
pyspark==3.5.0
python-dateutil==2.8.2
python-json-logger==2.0.7
pytz==2023.3.post1
PyYAML==6.0.1
pyzmq==25.1.1
referencing==0.30.2
requests==2.31.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.10.3
scanpy==1.9.5
scikit-learn==1.3.1
scipy==1.11.3
scvelo==0.2.5
seaborn==0.12.2
Send2Trash==1.8.2
session-info==1.0.0
six==1.16.0
sniffio==1.3.0
soupsieve==2.5
stack-data==0.6.2
statsmodels==0.14.0
stdlib-list==0.9.0
sympy==1.12
tbb==2021.10.0
tensorly==0.8.1
terminado==0.17.1
texttable==1.6.7
threadpoolctl==3.2.0
tinycss2==1.2.1
torch==2.0.1
tornado==6.3.3
tqdm==4.62.3
traitlets==5.10.1
triton==2.0.0
typing_extensions==4.8.0
tzdata==2023.3
umap-learn==0.5.4
uri-template==1.3.0
urllib3==2.0.5
velovae @ file:///work/02_data/VeloVAE
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.6.3
widgetsnbextension==4.0.9

[ 09:24:53 ] ➜  python --version
Python 3.11.5

I'm running:

import anndata as ad
import scvelo as scv
import velovae as vv

n_gene = 2000
vv.preprocess(adata, n_gene)

use_gpu = True
if use_gpu:
    import tensorly as tl
    tl.set_backend("pytorch")

vae = vv.VAE(adata, 
             tmax=20, 
             dim_z=5, 
             device='cuda:0')

config = {
}

vae.train(adata,
          config=config,
          plot=True, 
          cluster_key = "celltype",
          gene_plot = ["CD3E", "MRC1"],
          figure_path="/work/02_data/VeloVAE/figures/",
          embed='umap')

g-yichen commented 1 year ago

Hello, I tested VeloVAE on cpu(Intel Xeon Gold 6154, 4 nodes, 32 cores per node), spgpu(Nvidia A40) and gpu (Nvidia V100). Using GPUs should give you a 3-5x speed up. For example, for the pancreas dataset shown in the example notebook, CPU training took about 23 minutes, while for both spgpu and gpu training took about 5-6 minutes. The difference is quite clear even without using a time profiler.

It seems you might have a cuda issue. Could you provide more details?

rrydbirk commented 1 year ago

@g-yichen I'd be happy to provide more details, I'm just not sure what to provide :-)

You have my full pip freeze above and my notebook snippets. There's no warning about "GPU not found" which occurs on a non-GPU node. Using nvidia-smi, I can see GPU usage bounce up and down, but nothing overwhelming.

welch-lab / VeloVAE

GPU not utilized correctly, slower than CPU #3