OutOfMemoryError - Githubissues

zyhhh123 commented 1 year ago

When doing fine-tuning, I'm wondering if the code for the change detection dataset is not complete enough. In levir-cd training, even if I reduce the image to 256 and set the batch-size to 1, a single 24G card memory is still not enough. Is it because the unet model parameters of DDPM are too heavy? In particular, after the below message was output, the memory was maintained at 15G, then there is a memory overflow.

l_cd: 2.1025e-01 running_acc: 4.8673e-01 epoch_acc: 4.8607e-01 acc: 9.4579e-01 miou: 4.7289e-01 mf1: 4.8607e-01 iou_0: 9.4579e-01 iou_1: 0.0000e+00 F1_0: 9.7214e-01 F1_1: 0.0000e+00 precision_0: 9.4781e-01 precision_1: 0.0000e+00 recall_0: 9.9775e-01 recall_1: 0.0000e+00

Creating [train] change-detection dataloader. 23-02-08 01:27:18.810 - INFO: Dataset [CDDataset - LEVIR-CD-256 - train] is created. Creating [val] change-detection dataloader. 23-02-08 01:27:18.814 - INFO: Dataset [CDDataset - LEVIR-CD-256 - val] is created. 23-02-08 01:27:18.814 - INFO: Initial Dataset Finished 23-02-08 01:27:24.129 - INFO: Initialization method [orthogonal] 23-02-08 01:27:35.712 - INFO: Loading pretrained model for G [/root/autodl-tmp/diffusion-model-I190000_E97] ... 23-02-08 01:27:41.615 - INFO: Model [DDPM] is created. 23-02-08 01:27:41.616 - INFO: Initial Diffusion Model Finished 23-02-08 01:27:42.054 - INFO: Initialization method [orthogonal] 23-02-08 01:27:43.104 - INFO: Cd Model [CD] is created. 23-02-08 01:27:43.105 - INFO: lr: 0.0001000

23-02-08 01:27:55.545 - INFO: [Training CD]. epoch: [0/119]. Itter: [0/445], CD_loss: 0.80819, running_mf1: 0.04829

23-02-08 01:30:22.239 - INFO: [Training CD (epoch summary)]: epoch: [0/119]. epoch_mF1=0.48607 l_cd: 2.1025e-01 running_acc: 4.8673e-01 epoch_acc: 4.8607e-01 acc: 9.4579e-01 miou: 4.7289e-01 mf1: 4.8607e-01 iou_0: 9.4579e-01 iou_1: 0.0000e+00 F1_0: 9.7214e-01 F1_1: 0.0000e+00 precision_0: 9.4781e-01 precision_1: 0.0000e+00 recall_0: 9.9775e-01 recall_1: 0.0000e+00

Traceback (most recent call last): File "/home/ddpm-cd/ddpm_cd.py", line 225, in fe_A_t, fd_A_t, fe_B_t, fd_B_t = diffusion.get_feats(t=t) #np.random.randint(low=2, high=8) File "/home/ddpm-cd/model/model.py", line 91, in get_feats fe_B, fd_B = self.netG.feats(B, t) File "/root/miniconda3/envs/ddpm-cd/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/home/ddpm-cd/model/sr3_modules/diffusion.py", line 269, in feats fe, fd = self.denoise_fn(x_noisy, continuous_sqrt_alpha_cumprod, feat_need=True) File "/root/miniconda3/envs/ddpm-cd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/ddpm-cd/model/sr3_modules/unet.py", line 271, in forward x = layer(torch.cat((x, feats.pop()), dim=1), t) File "/root/miniconda3/envs/ddpm-cd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/ddpm-cd/model/sr3_modules/unet.py", line 155, in forward x = self.res_block(x, time_emb) File "/root/miniconda3/envs/ddpm-cd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/home/ddpm-cd/model/sr3_modules/unet.py", line 107, in forward h = self.block1(x) File "/root/miniconda3/envs/ddpm-cd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/ddpm-cd/model/sr3_modules/unet.py", line 91, in forward return self.block(x) File "/root/miniconda3/envs/ddpm-cd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/root/miniconda3/envs/ddpm-cd/lib/python3.9/site-packages/torch/nn/modules/container.py", line 204, in forward input = module(input) File "/root/miniconda3/envs/ddpm-cd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/ddpm-cd/model/sr3_modules/unet.py", line 55, in forward return x torch.sigmoid(x) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.50 GiB (GPU 0; 23.70 GiB total capacity; 19.31 GiB already allocated; 384.56 MiB free; 22.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

wgcban commented 1 year ago

@zyhhh123 Hi, for all experiments, I used a 48GB GPU and batch_size=8 works well without any out-of-memory error. Can you please check the number of timesteps you are using for the fine-tuning? https://github.com/wgcban/ddpm-cd/blob/b0213c0049bab215e470326d97499ae69416a843/config/levir.json#L63

TranquilChan commented 3 months ago

Hello, I have the same issue. How did you solve this problem?

wgcban / ddpm-cd

OutOfMemoryError #14