While training the code on a 4 GPU system, The memory utilization suddenly exploded after 5 epochs, Thus killing the process. I was training the code on university HPC, system specification
24 Cores
128 GB RAM
4 Nvidia Quadro RTX 8000
You could try to reduce the batch size. It rarely happens that two input images in the same batch are respectively very high and wide. But it could cause memory out when GPU memory is 99% occupied.
While training the code on a 4 GPU system, The memory utilization suddenly exploded after 5 epochs, Thus killing the process. I was training the code on university HPC, system specification 24 Cores 128 GB RAM 4 Nvidia Quadro RTX 8000