yuval-alaluf / stylegan3-editing

Official Implementation of "Third Time's the Charm? Image and Video Editing with StyleGAN3" (AIM ECCVW 2022) https://arxiv.org/abs/2201.13433
https://yuval-alaluf.github.io/stylegan3-editing/
MIT License
654 stars 73 forks source link

GPU RAM grows to OOM during inference #37

Closed ikcla closed 2 years ago

ikcla commented 2 years ago

I tried to use inversion/video/inference_on_video.py for inferencing a test video. I realized that with 11GB RAM on GPU, the process will manage up to 599 frames and go out of GPU RAM. When I look at the first layer of the code, It is processing one frame at a time. I wonder why GPU RAM keep accummulate during the for loop. I tried using del, gc.collect(), torch.cuda.empty_cache() in the for loop, but it does not help. Do you have any pointers for this issue? Thanks.

yuval-alaluf commented 2 years ago

The memory keeps growing since we store the results in the following line: https://github.com/yuval-alaluf/stylegan3-editing/blob/ab01a5d90b8ba67e0da0e1388f0931482601006c/inversion/video/inference_on_video.py#L121 One option is to change the following line: https://github.com/yuval-alaluf/stylegan3-editing/blob/ab01a5d90b8ba67e0da0e1388f0931482601006c/inversion/video/inference_on_video.py#L142-L143 You could try doing latents[0][-1].cpu() and result_batch[0][-1] You may need to move them back to GPU later, but here it should be able to run without accumulating more memory.

Hope this helps!

ikcla commented 2 years ago

@yuval-alaluf Thank you for your help. I already profiled the code, and found out that the GPU memory accumulate faster due to result images instead of latents since image store as tensor is larger compare to latents. As above code, it kept result images as tensors in gpu with reference it in dictionary which will get my low memory gpu into a problem. I have to move them back to cpu and save it to disk in order to run sequential tasks for 5 minutes video. I found other things when I profiled this code is that we kept everything in cpu memory like aligned images, cropped images, and result images which will make OOM for cpu as well since Image object is large(My system has 128 GB ram, but cannot handle 5 mintues video). I have to save to disk and use lazy load for single iter atomically.

You guy did a great work, and I learned so much from it. Do you have any experience with ffhqu pretrain in this experiment?