xiaozhah / Aligner

Aligner for text-to-speech
15 stars 2 forks source link

file loss and usage infomation #1

Open kingfener opened 2 months ago

kingfener commented 2 months ago

First of all, thanks for contributing the code to the open source community. I encountered the following problems when using it:

1, file : requirements.txt miss 2, file: setup.py miss in: python setup.py build_ext --inplace 3, before do alignment, how could I get : text_mask, mel_embeddings for : aligner(args) 4, what will happen if a miss match wav and text was used for align ?

thanks.

xiaozhah commented 1 month ago

Thank you for your interest in the project and for bringing these issues to our attention. I apologize for any inconvenience caused. Let me address each of your points:

  1. Missing requirements.txt: You're right, and I apologize for this oversight. I'll create and add a requirements.txt file to the repository. For now, the main dependencies are:

    Cython==3.0.10
    numpy==1.23.5
    torch==2.1.0

    Please install these using pip install -r requirements.txt once the file is added.

  2. Missing setup.py: not missing, in monotonic_align/setup.py

  3. Obtaining text_mask and mel_embeddings:

    • text_mask is a boolean tensor indicating which elements in the text sequence are valid (not padding). You can create it based on your input text length.
    • mel_embeddings are typically extracted from your mel spectrogram using a pre-processing step or a neural network. The exact method depends on your TTS pipeline.

    Here's a simple example:

    import torch
    
    # Assuming batch_size = 1, seq_len = 10, embedding_dim = 80
    text_embeddings = torch.randn(1, 10, 256)  # Replace with your actual text embeddings
    mel_embeddings = torch.randn(1, 100, 80)   # Replace with your actual mel embeddings
    
    text_mask = torch.ones(1, 10).bool()       # Adjust based on your actual text length
    mel_mask = torch.ones(1, 100).bool()       # Adjust based on your actual mel length
    
    alignment = aligner(text_embeddings, mel_embeddings, text_mask, mel_mask)
  4. Mismatched wav and text: Using mismatched wav and text for alignment is not recommended as it will produce incorrect alignments. The aligner assumes that the input text and audio correspond to each other. If they don't match:

    • The alignment process might still complete, but the results will be meaningless.
    • You might encounter errors if the lengths are significantly different.
    • The quality of any TTS system using these alignments will be severely compromised.

    Always ensure that your wav files and text inputs correspond correctly to each other.

I hope this helps clarify things. I'll update the repository with the requirements.txt file and improve the documentation to make the usage clearer. If you have any more questions, please don't hesitate to ask!