Closed zyhhh123 closed 1 year ago
@zyhhh123 Hi, for all experiments, I used a 48GB GPU and batch_size=8 works well without any out-of-memory error. Can you please check the number of timesteps you are using for the fine-tuning? https://github.com/wgcban/ddpm-cd/blob/b0213c0049bab215e470326d97499ae69416a843/config/levir.json#L63
Hello, I have the same issue. How did you solve this problem?
When doing fine-tuning, I'm wondering if the code for the change detection dataset is not complete enough. In levir-cd training, even if I reduce the image to 256 and set the batch-size to 1, a single 24G card memory is still not enough. Is it because the unet model parameters of DDPM are too heavy? In particular, after the below message was output, the memory was maintained at 15G, then there is a memory overflow.
l_cd: 2.1025e-01 running_acc: 4.8673e-01 epoch_acc: 4.8607e-01 acc: 9.4579e-01 miou: 4.7289e-01 mf1: 4.8607e-01 iou_0: 9.4579e-01 iou_1: 0.0000e+00 F1_0: 9.7214e-01 F1_1: 0.0000e+00 precision_0: 9.4781e-01 precision_1: 0.0000e+00 recall_0: 9.9775e-01 recall_1: 0.0000e+00
Creating [train] change-detection dataloader. 23-02-08 01:27:18.810 - INFO: Dataset [CDDataset - LEVIR-CD-256 - train] is created. Creating [val] change-detection dataloader. 23-02-08 01:27:18.814 - INFO: Dataset [CDDataset - LEVIR-CD-256 - val] is created. 23-02-08 01:27:18.814 - INFO: Initial Dataset Finished 23-02-08 01:27:24.129 - INFO: Initialization method [orthogonal] 23-02-08 01:27:35.712 - INFO: Loading pretrained model for G [/root/autodl-tmp/diffusion-model-I190000_E97] ... 23-02-08 01:27:41.615 - INFO: Model [DDPM] is created. 23-02-08 01:27:41.616 - INFO: Initial Diffusion Model Finished 23-02-08 01:27:42.054 - INFO: Initialization method [orthogonal] 23-02-08 01:27:43.104 - INFO: Cd Model [CD] is created. 23-02-08 01:27:43.105 - INFO: lr: 0.0001000
23-02-08 01:27:55.545 - INFO: [Training CD]. epoch: [0/119]. Itter: [0/445], CD_loss: 0.80819, running_mf1: 0.04829
23-02-08 01:30:22.239 - INFO: [Training CD (epoch summary)]: epoch: [0/119]. epoch_mF1=0.48607 l_cd: 2.1025e-01 running_acc: 4.8673e-01 epoch_acc: 4.8607e-01 acc: 9.4579e-01 miou: 4.7289e-01 mf1: 4.8607e-01 iou_0: 9.4579e-01 iou_1: 0.0000e+00 F1_0: 9.7214e-01 F1_1: 0.0000e+00 precision_0: 9.4781e-01 precision_1: 0.0000e+00 recall_0: 9.9775e-01 recall_1: 0.0000e+00
Traceback (most recent call last): File "/home/ddpm-cd/ddpm_cd.py", line 225, in
fe_A_t, fd_A_t, fe_B_t, fd_B_t = diffusion.get_feats(t=t) #np.random.randint(low=2, high=8)
File "/home/ddpm-cd/model/model.py", line 91, in get_feats
fe_B, fd_B = self.netG.feats(B, t)
File "/root/miniconda3/envs/ddpm-cd/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, kwargs)
File "/home/ddpm-cd/model/sr3_modules/diffusion.py", line 269, in feats
fe, fd = self.denoise_fn(x_noisy, continuous_sqrt_alpha_cumprod, feat_need=True)
File "/root/miniconda3/envs/ddpm-cd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/home/ddpm-cd/model/sr3_modules/unet.py", line 271, in forward
x = layer(torch.cat((x, feats.pop()), dim=1), t)
File "/root/miniconda3/envs/ddpm-cd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/home/ddpm-cd/model/sr3_modules/unet.py", line 155, in forward
x = self.res_block(x, time_emb)
File "/root/miniconda3/envs/ddpm-cd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, kwargs)
File "/home/ddpm-cd/model/sr3_modules/unet.py", line 107, in forward
h = self.block1(x)
File "/root/miniconda3/envs/ddpm-cd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/home/ddpm-cd/model/sr3_modules/unet.py", line 91, in forward
return self.block(x)
File "/root/miniconda3/envs/ddpm-cd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/root/miniconda3/envs/ddpm-cd/lib/python3.9/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/root/miniconda3/envs/ddpm-cd/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/home/ddpm-cd/model/sr3_modules/unet.py", line 55, in forward
return x torch.sigmoid(x)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.50 GiB (GPU 0; 23.70 GiB total capacity; 19.31 GiB already allocated; 384.56 MiB free; 22.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF