CUDA out of memory batch size=2, (V100 32G)

JunMa11 commented 2 years ago

Dear @yang-song ,

Thanks for the great work.

I'm always running into OOM error even if reducing the batch size to 2. This is the command that I run:

python main.py --config configs/vp/cifar10_ddpmpp.py --mode train --workdir ./workdir

and the error information

I0317 15:24:53.050663 47465039521600 run_lib.py:126] Starting training loop at step 0.
terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError'
  what():  CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 31.75 GiB total capacity; 1.00 GiB already allocated; 4.00 MiB free; 1.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at /tmp/coulombc/pytorch_build_2021-11-09_14-57-01/avx2/python3.8/pytorch/c10/cuda/CUDACachingAllocator.cpp:513 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x2b2c1d81f905 in /home/jma/codes/score/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x295bf (0x2b2c1d7c15bf in /home/jma/codes/score/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x2a2c5 (0x2b2c1d7c22c5 in /home/jma/codes/score/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x2a7d2 (0x2b2c1d7c27d2 in /home/jma/codes/score/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: THCStorage_resizeBytes(THCState*, c10::StorageImpl*, long) + 0x84 (0x2b2c047bb894 in /home/jma/codes/score/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x1c9d961 (0x2b2c03178961 in /home/jma/codes/score/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: at::native::empty_strided_cuda(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) + 0x66 (0x2b2c04506346 in /home/jma/codes/score/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x3176efa (0x2b2c04651efa in /home/jma/codes/score/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x3176f70 (0x2b2c04651f70 in /home/jma/codes/score/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #9: <unknown function> + 0x1e56e88 (0x2b2bf8d78e88 in /home/jma/codes/score/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)

Fatal Python error: Aborted

How should I train the model on single GPU (NVIDIA V100 32G)?

Best regards, Jun

Newbeeer commented 2 years ago

Face the same issue here. Adding the following code solves the problem for me.

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

DveloperY0115 commented 2 years ago

Face the same issue here. Adding the following code solves the problem for me.

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

Thanks, it worked for me. FYI, I'm using 4 RTX 3090 (24 G) with batch size 28 (7 samples per GPU), and 128 x 128 images.

yang-song / score_sde_pytorch

CUDA out of memory batch size=2, (V100 32G) #14