Refacing operation does not seem to fully use GPU

ooofest commented 1 year ago

Hello, I tried the GPU installation instructions on Windows, but within a venv. It complained about the lack of torch libraries, so I reused the roop repo's GPU installation steps for refacer (not all items might have been necessary for refacer):

install cuda 11.7 (https://developer.nvidia.com/cuda-11-7-0-download-archive)
download cudnn 8.9.1 for cuda 11.x https://developer.nvidia.com/rdp/cudnn-archive
unpack cudnn over C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7 with replacement
install python 3.10.x (any 3.10)
download the last version of refacer
pip install virtualenv
virtualenv venv
venv\scripts\activate.bat
pip install torch torchvision torchaudio --force-reinstall --index-url https://download.pytorch.org/whl/cu117
pip install -r requirements-GPU.txt
Add ...\refacer-main\venv\Lib\site-packages\torch\lib to PATH (because refacer complained that one of the libraries was not visible during run time)
SET CUDA_VISIBLE_DEVICES=1 (optional, uses my second GPU)

Refacer initially starts with an aggressive speed and estimate (6 minutes at 54it/s) , but seconds later changes to use a far slower speed (1 hour at 5.2it/s). By comparison, the CPU-only refacer configuration almost triples the execution time.

By contrast, the similar roop codebase - when configured for GPU use - currently requires less than 15 minutes from start to finish.

I noticed that GPU monitors from ASUS and the built-in from WIndows 10 show a max of ~35% utilization with low VRAM allocation while refacer is processing a video. Those same monitors show >90% GPU utilization and far higher memory usage while processing the same video.

Not sure if refacer code is intended to be less taxing on the GPU because of its unique processing logic, if this might be a potential symptom of using venv or if it's possibly a bug situation, but I wanted to report in case it offers a helpful user perspective.

xaviviro commented 1 year ago

First off, I want to thank you for the detailed information you've provided. I've initially focused on the functionality, and now I'm moving towards performance-related issues, like enhancing GPU utilization and parallel processing. I'm also aiming to address GPU usage on OSX CoreML. I apologize for any inconvenience. As of now, I'm the sole contributor and working on this in my spare time. Thanks for your understanding and patience!

In addition, I should point out that it's unlikely Refacer will be able to match Roop's speed. Roop doesn't do face comparisons, while Refacer does, which is why Refacer allows for the selection of which face to replace, one or many. Furthermore, please keep in mind that the processing time increases as the number of faces that need to be compared increases.

ooofest commented 1 year ago

Thanks for your helpful reply!

Yes, I figured that Refacer's unique logic to detect faces - which has been working very well for me, thus far - could add cycles to the processing.

With that in mind, perhaps GPU multithreading might be an avenue to consider?

Thanks for this repo, the ability to specify a particular face for swapping and keeping the resulting swapped filesize reasonable make this valuable and a good complement to Roop.

xaviviro commented 1 year ago

Wow, @ooofest I just tried it on Google Colab with the latest update and it gives me over 8it/s. If you want to try on Colab:

xaviviro commented 1 year ago

The only thing left for me to add is NVIDIA acceleration to the final ffmpeg process. Stay tuned for updates on that. Thank you for your patience and feedback!

ooofest commented 1 year ago

The only thing left for me to add is NVIDIA acceleration to the final ffmpeg process. Stay tuned for updates on that. Thank you for your patience and feedback!

It is much faster in overall processing now! The speed increase is tremendous . . . here is a quick example:

To create a public link, set share=True in launch(). Total frames: 13966 Extracting frames: 100%|██████████████████████████████████████████████████████▉| 13965/13966 [00:04<00:00, 2984.45it/s] Processing frames: 100%|█████████████████████████████████████████████████████████| 13965/13965 [08:04<00:00, 28.85it/s] Merging audio with the refaced video... The process has finished.

Although I did notice some cases where - if the video file is rather large (e.g., > 700MB) then there would be a timeout error in Gradio usually after the image extraction step.

Also, there are some videos where a memory allocation error crops up and I am still experimenting to see what might be the type of input which causes this symptom:

To create a public link, set share=True in launch(). Total frames: 4096 Extracting frames: 100%|██████████████████████████████████████████████████████████| 4096/4096 [00:14<00:00, 280.75it/s] Processing frames: 0%| | 8/4096 [00:00<04:38, 14.67it/s] Traceback (most recent call last): File "D:\refacer-main\venv\lib\site-packages\gradio\routes.py", line 427, in run_predict output = await app.get_blocks().process_api( File "D:\refacer-main\venv\lib\site-packages\gradio\blocks.py", line 1323, in process_api result = await self.call_function( File "D:\refacer-main\venv\lib\site-packages\gradio\blocks.py", line 1051, in call_function prediction = await anyio.to_thread.run_sync( File "D:\refacer-main\venv\lib\site-packages\anyio\to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "D:\refacer-main\venv\lib\site-packages\anyio_backends_asyncio.py", line 877, in run_sync_in_worker_thread return await future File "D:\refacer-main\venv\lib\site-packages\anyio_backends_asyncio.py", line 807, in run result = context.run(func, args) File "D:\refacer-main\app.py", line 30, in run return refacer.reface(video_path,faces) File "D:\refacer-main\refacer.py", line 184, in reface results = list(tqdm(executor.map(self.process_faces, frames), total=len(frames),desc="Processing frames")) File "D:\refacer-main\venv\lib\site-packages\tqdm\std.py", line 1178, in iter for obj in iterable: File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\concurrent\futures_base.py", line 621, in result_iterator yield _result_or_cancel(fs.pop()) File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\concurrent\futures_base.py", line 319, in _result_or_cancel return fut.result(timeout) File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\concurrent\futures_base.py", line 458, in result return self.get_result() File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\concurrent\futures_base.py", line 403, in __get_result raise self._exception File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\concurrent\futures\thread.py", line 58, in run result = self.fn(self.args, self.kwargs) File "D:\refacer-main\refacer.py", line 144, in __process_faces frame = self.face_swapper.get(frame, face, rep_face[1], paste_back=True) File "D:\refacer-main\venv\lib\site-packages\insightface\model_zoo\inswapper.py", line 53, in get pred = self.session.run(self.output_names, {self.input_names[0]: blob, self.input_names[1]: latent})[0] File "D:\refacer-main\venv\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 217, in run return self._sess.run(output_names, input_feed, run_options) onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Conv node. Name:'Conv_42' Status Message: D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:121 onnxruntime::CudaCall D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:114 onnxruntime::CudaCall CUDA failure 2: out of memory ; GPU=0 ; hostname=HOMEPC ; file=D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_allocator.cc ; line=48 ; expr=cudaMalloc((void)&p, size);

suphamster commented 1 year ago

I've speed up about 4x faster (from 5-6 it/s to 20-24 it/s) with this tweak https://github.com/xaviviro/refacer/compare/main...suphamster:refacer:patch-1 but I dunno why GPU usage still low on current version of refacer and tweak raises CPU load only. I have RTX 4070 GPU, Win10 22H2.

xaviviro / refacer

Refacing operation does not seem to fully use GPU #5