yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.98k stars 422 forks source link

Very high GPU memory usage in voice cloning after 10-15 runs. #222

Open amssss0 opened 8 months ago

amssss0 commented 8 months ago

PDF to Audiobook Caper - Memory Leak Mystery! Yo! Just tried to turn a PDF into an audiobook with some sweet text-to-speech (TTS) code (https://github.com/yl4579/StyleTTS2), but things went south faster than a rogue audiobook narrator.

The Plot: At first, everything was smooth sailing. GPU usage stayed low (chillin' around 6GB) for the first 10-15 sentences (think 150-250 words each). But then, memory mayhem! Processing more text caused a massive jump in usage, totally maxing out my trusty T4's 15GB VRAM and crashing the mission.

Suspect Identified: After some code sleuthing, it seems the model.decode() function is the memory hoarding culprit. Bummer!

Calling All Code Commandos: Has anyone out there apprehended this memory leak and patched model.decode()? If so, spill the beans! Any intel is appreciated.

Stay tuned, fellow audio adventurers. We'll crack this case and get those audiobooks flowing in no time!

amssss0 commented 8 months ago

Used this code: https://github.com/yl4579/StyleTTS2/blob/main/Colab/StyleTTS2_Demo_LibriTTS.ipynb

niranjanakella commented 3 months ago

@yl4579 I agree, I saw similar spikes when trying other endpoints for LJSpeech with noise too. Very strange behaviour. @amssss0 any luck so far? Will give it a try too.