ylacombe / finetune-hf-vits

Finetune VITS and MMS using HuggingFace's tools
MIT License
101 stars 21 forks source link

Getting error while fine tuning for Hindi #13

Open sanjitk2014 opened 5 months ago

sanjitk2014 commented 5 months ago

Thanks . I am getting the below error basically RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

Please help. I am using google Colab . I exactly following the instruction.

024-02-19 11:48:42.153900: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-02-19 11:48:42.153955: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-02-19 11:48:42.155392: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-02-19 11:48:43.496722: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT /usr/local/lib/python3.10/dist-packages/torch/functional.py:650: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error. Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:863.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] Steps: 0%| | 50/175200 [00:36<26:49:06, 1.81it/s, lr=2e-5, step_loss=29.5, step_loss_disc=2.78, step_loss_duration=1.5 02/19/2024 11:49:16 - INFO - main - Running validation... VALIDATION - batch 0, process0, waveform torch.Size([4, 134400, 1]), tokens torch.Size([4, 169])... VALIDATION - batch 0, process0, PADDING AND GATHER... Traceback (most recent call last): File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/run_vits_finetuning.py", line 1494, in main() File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/run_vits_finetuning.py", line 1327, in main full_generation = model(full_generation_sample.to(model.device), speaker_id=speaker_id) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 817, in forward return model_forward(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 805, in call return convert_to_fp32(self.model_forward(*args, *kwargs)) File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast return func(args, kwargs) File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 2151, in forward return self._inference_forward( File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 2000, in _inference_forward text_encoder_output = self.text_encoder( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1563, in forward hidden_states = self.embed_tokens(input_ids) math.sqrt(self.config.hidden_size) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 162, in forward return F.embedding( File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2233, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

ylacombe commented 5 months ago

Hey @sanjitk2014, you should probably check your samples, might very well be because of empty text or empty audio, let me know how it goes

sanjitk2014 commented 5 months ago

I have checked the dataset no empty audio and empty text. Use the following code to verify the dataset

import datasets from datasets import DatasetDict, load_dataset

dataset=load_dataset("/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/ttsdata") def prepare_dataset(batch):

load

audio = batch["audio"]

batch["input_length"] = len(audio["array"]) / audio["sampling_rate"]

if batch["input_length"] <=0 :
  print(batch["file_name"])
# process targets
input_str = batch["transcription"]
if len(input_str) <=0 :
  print(batch["file_name"])

# encode target text to label ids

return batch

train_data1 = dataset.map(prepare_dataset, num_proc=1)

sanjitk2014 commented 5 months ago

The checkpoint model I have generated from facebook/tts-mms-hin and using that as the pre trained model.

ylacombe commented 5 months ago

You should test if it's empty after having prepared the dataset I think

sanjitk2014 commented 5 months ago

I have checked the dataset no empty value or empty string. Still getting same error.

sanjitk2014 commented 5 months ago

Hi Ylacombe, After the changing the input_ids to int() before passing to nn_Embedding , I resolved the issue but tumbled with the following exception.

tensor([], device='cuda:0', size=(1, 0, 192)) Traceback (most recent call last): File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/run_vits_finetuning.py", line 1494, in main() File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/run_vits_finetuning.py", line 1327, in main full_generation = model(full_generation_sample.to(model.device), speaker_id=speaker_id) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 817, in forward return model_forward(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 805, in call return convert_to_fp32(self.model_forward(*args, *kwargs)) File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast return func(args, kwargs) File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 2159, in forward return self._inference_forward( File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 2008, in _inference_forward text_encoder_output = self.text_encoder( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1573, in forward encoder_outputs = self.encoder( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1507, in forward layer_outputs = encoder_layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1437, in forward hidden_states, attn_weights = self.attention( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, **kwargs) File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1282, in forward rel_pos_bias = self._relative_position_to_absolute_position(relative_logits) File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1354, in _relative_position_to_absolute_position x = nn.functional.pad(x, [0, 1, 0, 0, 0, 0]) RuntimeError: The input size 0, plus negative padding 0 and 0 resulted in a negative output size, which is invalid. Check dimension 1 of your input.

VafaKnm commented 4 months ago

Hi! i get same error while fine-tuning mms-tts-fas model for Persian(Farsi) language; I print waveforms and token_ids for debugging which you can see them in the screenshots below. As you can see, some samples have empty tokens however there is not any empty text in my dataset. Do you find any solution for this?

Screenshot (152) Screenshot (153) Screenshot (155)

ylacombe commented 4 months ago

Hi, screenshots like this are really not helpful!

Both your issues seem related to some samples being empty, i.e not tokenized properly. Could you give a link to the datasets you're using ?

Thanks

VafaKnm commented 4 months ago

Hi! This is the "prepare_dataset" function in the "run_vits_finetuning". I add two lines of code for writing some information to text file for debugging. one of them is "input_str" which is output of "uromanize" function and other one is "string_inputs" which is output of tokenizer.

    def prepare_dataset(batch):
        # process target audio
        sample = batch[audio_column_name]
        audio_inputs = feature_extractor(
            sample["array"],
            sampling_rate=sample["sampling_rate"],
            return_attention_mask=False,
            do_normalize=do_normalize,
        )

        batch["labels"] = audio_inputs.get("input_features")[0]

        # process text inputs
        input_str = batch[text_column_name].lower() if do_lower_case else batch[text_column_name]

        if is_uroman:
            input_str = uromanize(input_str, uroman_path=uroman_path)
        string_inputs = tokenizer(input_str, return_attention_mask=False)

        # Writing input_str to a text file
        with open("/home/user1/vits_input_str.txt", "a") as file:
            file.write(input_str + "\n")

        # Writing string_inputs to a text file
        with open("/home/user1/vits_string_inputs.txt", "a") as file:
            file.write(str(string_inputs) + "\n")

        batch[model_input_name] = string_inputs.get("input_ids")[: max_tokens_length + 1]
        batch["waveform_input_length"] = len(sample["array"])
        batch["tokens_input_length"] = len(batch[model_input_name])
        batch["waveform"] = batch[audio_column_name]["array"]

        batch["mel_scaled_input_features"] = audio_inputs.get("mel_scaled_input_features")[0]

        if speaker_id_column_name is not None:
            if new_num_speakers > 1:
                # align speaker_id to [0, num_speaker_id-1].
                batch["speaker_id"] = speaker_id_dict.get(batch[speaker_id_column_name], 0)
        return batch

After monitoring these text files, i found that the file related to "uromanize" is correct but the file related to "tokenizer" has some problem; some of tokens are empty and most of them tokenize wrongly. I noticed that, despite the fact that according to the documentation, Persian is one of the uroman languages, but the "is_uroman" parameter in the "tokenizer_config" file was set to "False" at the main model: https://huggingface.co/facebook/mms-tts-fas/blob/main/tokenizer_config.json

So, i change my previous config and set "is_uroman" to False. in result, this error fixed to me.