Open jingli-wtbox opened 8 months ago
Setting up the conversation usually takes around 60 sec.
Afterwards Chatting would usually takes 6 sec to get respond from chatgpt.
I tested on one gpu rendering takes around 8 sec (RTX A5000). But the rendering of sadtalker could be parallelized.
You can try run locally and see whether it was real-time. :)
thank you. will have a try on other types of GPU.
Can you theoretically just run this one 8xH100 and it'll work in "real-time". Maybe a real time conversation version of this software should be looked into.
Can you theoretically just run this one 8xH100 and it'll work in "real-time". Maybe a real time conversation version of this software should be looked into.
Hi, thanks for your interest in the work! we do not have H100 at hand right now... However, based on our observation, on A100 GPUs, the total time cost excluding GPT API calls is within 10s, and the face rendering process takes 1-2s. We will try to replace the ChatGPT APIs for real-time chat in the coming month..
Great project! I replaced chatGPT with my own small model and tested it on my own 3080ti graphics card, and the time consumption details are as follows:
I wonder if anyone has any efficient implementation or ideas for accelerating the video generation process. I have been interested in this recently. What I want to do now is to output the facial image generation process in synchronization with the voice after TTS is completed. However, because the facial generation process is relatively slow, the streaming effect will actually be very jerky.
(My goal now is to be as smooth as D-ID, input any image and voice, and quickly generate videos or smooth streaming output.)
Great project! I replaced chatGPT with my own small model and tested it on my own 3080ti graphics card, and the time consumption details are as follows:
===================================
Face Renderer:: 100%|80/80 [00:22<00:00, 3.49it/s]
fps:25.0
OpenCV: FFMPEG: tag 0x44495658/'XVID' is not supported with codec id 12 and format 'mp4 / MP4 (MPEG-4 Part 14)' OpenCV: FFMPEG: fallback to use tag 0x7634706d/'mp4v' seamlessClone:: 100%|318/318 [00:15<00:00, 20.66it/s] I wonder if anyone has any efficient implementation or ideas for accelerating the video generation process. I have been interested in this recently. What I want to do now is to output the facial image generation process in synchronization with the voice after TTS is completed. However, because the facial generation process is relatively slow, the streaming effect will actually be very jerky.
(My goal now is to be as smooth as D-ID, input any image and voice, and quickly generate videos or smooth streaming output.)
btw,this is the message for the face rendering process --- “Thank you for the kind words. It is a pleasure to meet you as well. I am here to share the magic and beauty of the world around us. If you have any questions or need any guidance, I am always here to help.”
How did you replace ChatGPT, with another OpenAI model or a locally hosted OpenAI API compatible program?
How did you replace ChatGPT, with another OpenAI model or a locally hosted OpenAI API compatible program?
I simply wrapped my local model as a service (with input-output format similar to OpenAI) and deployed it locally, and then made some modifications to the content of /chat_anything/chatbot/chat.py.
Great project! I replaced chatGPT with my own small model and tested it on my own 3080ti graphics card, and the time consumption details are as follows:
===================================
Face Renderer:: 100%|80/80 [00:22<00:00, 3.49it/s]
fps:25.0
OpenCV: FFMPEG: tag 0x44495658/'XVID' is not supported with codec id 12 and format 'mp4 / MP4 (MPEG-4 Part 14)' OpenCV: FFMPEG: fallback to use tag 0x7634706d/'mp4v' seamlessClone:: 100%|318/318 [00:15<00:00, 20.66it/s] I wonder if anyone has any efficient implementation or ideas for accelerating the video generation process. I have been interested in this recently. What I want to do now is to output the facial image generation process in synchronization with the voice after TTS is completed. However, because the facial generation process is relatively slow, the streaming effect will actually be very jerky.
(My goal now is to be as smooth as D-ID, input any image and voice, and quickly generate videos or smooth streaming output.)
The facial image generation only executes once -- at the first round of conversation "..., Bot: how are you doing...". I think it would be acceptable for the latency since.
And by the way, this step "seamlessClone:: 100%|318/318 [00:15<00:00, 20.66it/s]" is a option for sadtalker to somehow not crop out the face for rendering and pasting it back. You can disable it by unchecking the "Use full body instead of a face." on the setting tab. It seems unoptimized and takes up lots of time O.o
Very excited to see more progress in this area!
Great project! I replaced chatGPT with my own small model and tested it on my own 3080ti graphics card, and the time consumption details are as follows:
===================================
Face Renderer:: 100%|80/80 [00:22<00:00, 3.49it/s]
fps:25.0
OpenCV: FFMPEG: tag 0x44495658/'XVID' is not supported with codec id 12 and format 'mp4 / MP4 (MPEG-4 Part 14)' OpenCV: FFMPEG: fallback to use tag 0x7634706d/'mp4v' seamlessClone:: 100%|318/318 [00:15<00:00, 20.66it/s] I wonder if anyone has any efficient implementation or ideas for accelerating the video generation process. I have been interested in this recently. What I want to do now is to output the facial image generation process in synchronization with the voice after TTS is completed. However, because the facial generation process is relatively slow, the streaming effect will actually be very jerky. (My goal now is to be as smooth as D-ID, input any image and voice, and quickly generate videos or smooth streaming output.)
The facial image generation only executes once -- at the first round of conversation "..., Bot: how are you doing...". I think it would be acceptable for the latency since.
And by the way, this step "seamlessClone:: 100%|318/318 [00:15<00:00, 20.66it/s]" is a option for sadtalker to somehow not crop out the face for rendering and pasting it back. You can disable it by unchecking the "Use full body instead of a face." on the setting tab. It seems unoptimized and takes up lots of time O.o
Yep, When running on 4090 (considering only face render), the time required for video generation is not significantly different from the video length. Theoretically, if it is a streaming output (at 25fps), a relatively smooth feeling can be achieved.
Currently, I am trying to integrate live2D, and further, I hope to input a custom full-body image for full-body driving (this is my next plan) just like a prepared-live2D model, but I don’t have much experience in this field of cv. Any suggestions about this?
Thank you for sharing such great work. It's awesome.
I find it's like a real-time chat when i go through some examples, like "Examples on Image-based Chat Persona" in below page:
example
May I know if ChatAnything supports real-time chat?
thanks