yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.78k stars 391 forks source link

Same voice with different emotion #203

Closed enla51 closed 7 months ago

enla51 commented 7 months ago

Is it possible to use one voice and the style of another voice for zero shot tts? Is there a way to differentiate between the style and the voice characteristics?

For example I have samples of a voice in natural style and I want to make it sound happy using samples of another happy voice.

Thank you.

rlenain commented 7 months ago

Yes you can do this. There are two tokens in the model: one "prosody" token and one acoustic/speaker token. You do not have to feed the same reference to both modules. So you can feed the happy reference to get the "prosody" token and feed the reference of the speaker in neutral style to get the acoustic/speaker token.

Let me know if that works!

enla51 commented 7 months ago

It worked thank you :)

eschmidbauer commented 7 months ago

This is very interesting. @enla51 Can you share a snippet of code on how you accomplished it?

ZYJGO commented 5 months ago

i tried using two pieces of inference audio to extract the ref_s and ref_p using compute_style function as below separately, e.g. ref_s from a neutral emotion audio file and ref_p from a anger audio file(and also vice versa), then combining them and feeding into the inference function, but the output preserved neither the speaker's style nor the speaker's emotions,

def compute_style(path):
    wave, sr = librosa.load(path, sr=24000)
    audio, index = librosa.effects.trim(wave, top_db=30)
    if sr != 24000:
        audio = librosa.resample(audio, sr, 24000)
    mel_tensor = preprocess(audio).to(device)

    with torch.no_grad():
        ref_s = model.style_encoder(mel_tensor.unsqueeze(1))
        ref_p = model.predictor_encoder(mel_tensor.unsqueeze(1))

    return torch.cat([ref_s, ref_p], dim=1)

has anyone tried other methods with success?

RoversCode commented 3 months ago

I think that if we don't change the current training approach and only make modifications during inference, at the very least, your model should be able to replicate the emotion from the reference audio, just like the example posted by the author. Of course, this is my speculation, and I hope that someone with more experience can share their practice.