xcmyz / FastSpeech

The Implementation of FastSpeech based on pytorch.
MIT License
858 stars 213 forks source link

integration of fastspeech with Squeezewave vocoder #59

Closed alokprasad closed 4 years ago

alokprasad commented 4 years ago

Placeholder for issue related to integration of fastspeech with squeezewave https://github.com/tianrengao/SqueezeWave seems to quite faster than waveflow.

alokprasad commented 4 years ago

i tried saving the mel_postnet_torch( melspectrogram) to a pt file , then used to generate wav from Squeezewave but i get following error.

Traceback (most recent call last): File "inference.py", line 87, in args.sampling_rate, args.is_fp16, args.denoiser_strength) File "inference.py", line 57, in main audio = squeezewave.infer(mel, sigma=sigma).float() File "/mount/data/SqueezeWave/glow.py", line 261, in infer output = self.WN[k]((audio_0, spect)) File "/home/alok/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/mount/data/SqueezeWave/glow.py", line 165, in forward spect = self.cond_layer(spect) File "/home/alok/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, **kwargs) File "/home/alok/.local/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 187, in forward self.padding, self.dilation, self.groups) RuntimeError: Expected 3-dimensional input for 3-dimensional weight [2048, 80, 1], but got 4-dimensional input of size [1, 1, 80, 133] instead

Any idea was could be the issue?

alokprasad commented 4 years ago

saving the mel_postnet_torch produces output which is the input to squeezewave melspec = torch.squeeze(mel_postnet_torch, 0) torch.save(melspec, "/tmp/test.pt")

test.pt will be melspectrogram input to squeezewave.

alokprasad commented 4 years ago

@xcmyz

Following Text -->" Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition in being comparatively modern" Got generated astonishing fast in single core cpu ( no gpu)( have included model loading time)

Audio Duration generated 11.5 Sec in around 3.83 seconds

MEL Calculation: 2.827802896499634

Squeezewave vocoder time 1.0016820430755615