yl4579 / StyleTTS

Official Implementation of StyleTTS
MIT License
385 stars 62 forks source link

mandrain support? #10

Open lucasjinreal opened 1 year ago

lucasjinreal commented 1 year ago

mandrain support?

yl4579 commented 1 year ago

I did try training for other languages including Mandarin, Japanese, Hindi etc., though it requires a few changes:

  1. You need to phonemize Chinese into IPAs. You can use either https://github.com/bootphon/phonemizer or a look-up table to replace Chinese characters into IPAs. The pre-trained text aligner already includes AiShell (Mandarin dataset), with the following IPA conversion table from Pinyin. It may be slightly different from phonemizer, as it didn't work for me for Chinese.
    
    ba pˈa
    bo pˈwɔ
    bai pˈaɪ
    bei pˈeɪ
    bao pˈaʊ
    ban pˈan
    ben pˈən
    bang pˈɑŋ
    beng pˈəŋ
    bi pˈi
    biao pˈjaʊ
    bie pˈjɛ
    bian pˈjɛn
    bin pˈin
    bing pˈiŋ
    bu pˈu
    pa pʰˈa
    po pʰˈwɔ
    pai pʰˈaɪ
    pei pʰˈeɪ
    pao pʰˈaʊ
    pou pʰˈoʊ
    pan pʰˈan
    pen pʰˈən
    pang pʰˈɑŋ
    peng pʰˈəŋ
    pi pʰˈi
    piao pʰˈjaʊ
    pie pʰˈjɛ
    pian pʰˈjɛn
    pin pʰˈin
    ping pʰˈiŋ
    pu pʰˈu
    ma mˈa
    me mˈɤ
    mo mˈwɔ
    mai mˈaɪ
    mei mˈeɪ
    mao mˈaʊ
    mou mˈoʊ
    man mˈan
    men mˈən
    mang mˈɑŋ
    meng mˈəŋ
    mi mˈi
    miao mˈjaʊ
    mie mˈjɛ
    miu mˈju
    mian mˈjɛn
    min mˈin
    ming mˈiŋ
    mu mˈu
    fa fˈa
    fo fˈwɔ
    fei fˈeɪ
    fou fˈoʊ
    fan fˈan
    fen fˈən
    fang fˈɑŋ
    feng fˈəŋ
    fu fˈu
    da tˈa
    de tˈɤ
    dai tˈaɪ
    dei tˈeɪ
    dao tˈaʊ
    dou tˈoʊ
    dan tˈan
    dang tˈɑŋ
    deng tˈəŋ
    dong tˈʊŋ
    di tˈi
    diao tˈjaʊ
    die tˈjɛ
    diu tˈjoʊ
    dian tˈjɛn
    ding tˈiŋ
    du tˈu
    duo tˈwɔ
    dui tˈweɪ
    duan tˈwan
    dun tˈwən
    ta tʰˈa
    te tʰˈɤ
    tai tʰˈaɪ
    tao tʰˈaʊ
    tou tʰˈoʊ
    tan tʰˈan
    tang tʰˈɑŋ
    teng tʰˈəŋ
    tong tʰˈʊŋ
    ti tʰˈi
    tiao tʰˈjaʊ
    tie tʰˈjɛ
    tian tʰˈjɛn
    ting tʰˈiŋ
    tu tʰˈu
    tuo tʰˈwɔ
    tui tʰˈweɪ
    tuan tʰˈwan
    tun tʰˈwən
    na nˈa
    ne nˈɤ
    nai nˈaɪ
    nei nˈeɪ
    nao nˈaʊ
    nou nˈoʊ
    nan nˈan
    nen nˈən
    nang nˈɑŋ
    neng nˈəŋ
    nong nˈʊŋ
    ni nˈi
    niao nˈjaʊ
    nie nˈjɛ
    niu nˈjoʊ
    nian nˈjɛn
    nin nˈin
    niang nˈiɑŋ
    ning nˈiŋ
    nu nˈu
    nuo nˈwɔ
    nuan nˈwan
    nü nˈy
    nüe nˈyɛ
    la lˈa
    le lˈɤ
    lai lˈaɪ
    lei lˈeɪ
    lao lˈaʊ
    lou lˈoʊ
    lan lˈan
    lang lˈɑŋ
    leng lˈəŋ
    long lˈʊŋ
    li lˈi
    lia lˈja
    liao lˈjaʊ
    lie lˈjɛ
    liu lˈjoʊ
    lian lˈjɛn
    lin lˈin
    liang lˈiɑŋ
    ling lˈiŋ
    lu lˈu
    luo lˈwɔ
    luan lˈwan
    lun lˈwən
    lü lˈy
    lüe lˈyɛ
    za tsˈa
    ze tsˈɤ
    zi tsˈɹ
    zai tsˈaɪ
    zei tsˈeɪ
    zao tsˈaʊ
    zou tsˈoʊ
    zan tsˈan
    zen tsˈən
    zang tsˈɑŋ
    zeng tsˈəŋ
    zong tsˈʊŋ
    zu tsˈu
    zuo tsˈwɔ
    zui tsˈweɪ
    zuan tsˈwan
    zun tsˈwən
    ca tsʰˈa
    ce tsʰˈɤ
    ci tsʰˈɹ
    cai tsʰˈaɪ
    cao tsʰˈaʊ
    cou tsʰˈoʊ
    can tsʰˈan
    cen tsʰˈən
    cang tsʰˈɑŋ
    ceng tsʰˈəŋ
    cong tsʰˈʊŋ
    cu tsʰˈu
    cuo tsʰˈwɔ
    cui tsʰˈweɪ
    cuan tsʰˈwan
    cun tsʰˈwən
    sa sˈa
    se sˈɤ
    si sˈɹ
    sai sˈaɪ
    sao sˈaʊ
    sou sˈoʊ
    san sˈan
    sen sˈən
    sang sˈɑŋ
    seng sˈeŋ
    song sˈʊŋ
    su sˈu
    suo sˈwɔ
    sui sˈweɪ
    suan sˈwan
    sun sˈwən
    zha ʈʂˈa
    zhe ʈʂˈɤ
    zhi ʈʂˈʐ
    zhai ʈʂˈaɪ
    zhei ʈʂˈeɪ
    zhao ʈʂˈaʊ
    zhou ʈʂˈoʊ
    zhan ʈʂˈan
    zhen ʈʂˈən
    zhang ʈʂˈɑŋ
    zheng ʈʂˈəŋ
    zhong ʈʂˈʊŋ
    zhu ʈʂˈu
    zhua ʈʂˈwa
    zhuo ʈʂˈwɔ
    zhuai ʈʂˈwaɪ
    zhui ʈʂˈweɪ
    zhuan ʈʂˈwan
    zhun ʈʂˈwən
    zhuang ʈʂˈwɑŋ
    cha ʈʂʰˈa
    che ʈʂʰˈɤ
    chi ʈʂʰˈʐ
    chai ʈʂʰˈaɪ
    chao ʈʂʰˈaʊ
    chou ʈʂʰˈoʊ
    chan ʈʂʰˈan
    chen ʈʂʰˈən
    chang ʈʂʰˈɑŋ
    cheng ʈʂʰˈəŋ
    chong ʈʂʰˈʊŋ
    chu ʈʂʰˈu
    chua ʈʂʰˈwa
    chuo ʈʂʰˈwɔ
    chuai ʈʂʰˈwaɪ
    chui ʈʂʰˈweɪ
    chuan ʈʂʰˈwan
    chun ʈʂʰˈwən
    chuang ʈʂʰˈwɑŋ
    sha ʂˈa
    she ʂˈɤ
    shi ʂˈʐ
    shai ʂˈaɪ
    shei ʂˈeɪ
    shao ʂˈaʊ
    shou ʂˈoʊ
    shan ʂˈan
    shen ʂˈən
    shang ʂˈɑŋ
    sheng ʂˈəŋ
    shu ʂˈu
    shua ʂˈwa
    shuo ʂˈwɔ
    shuai ʂˈwaɪ
    shui ʂˈweɪ
    shuan ʂˈwan
    shun ʂˈwən
    shuang ʂˈwɑŋ
    re ɹˈɤ
    ri ɹˈʐ
    rao ɹˈaʊ
    rou ɹˈoʊ
    ran ɹˈan
    ren ɹˈən
    rang ɹˈɑŋ
    reng ɹˈəŋ
    rong ɹˈʊŋ
    ru ɹˈu
    ruo ɹˈwɔ
    rui ɹˈweɪ
    ruan ɹˈwan
    run ɹˈwən
    ji tɕˈi
    jia tɕˈja
    jiao tɕˈjaʊ
    jie tɕˈjɛ
    jiu tɕˈjoʊ
    jian tɕˈjɛn
    jin tɕˈin
    jiang tɕˈiɑŋ
    jing tɕˈiŋ
    jiong tɕˈjʊŋ
    ju tɕˈy
    jue tɕˈyɛ
    juan tɕˈyɛn
    jun tɕˈyn
    qi tɕʰˈi
    qia tɕʰˈja
    qiao tɕʰˈjaʊ
    qie tɕʰˈjɛ
    qiu tɕʰˈjoʊ
    qian tɕʰˈjɛn
    qin tɕʰˈin
    qiang tɕʰˈjɑŋ
    qing tɕʰˈiŋ
    qiong tɕʰˈjʊŋ
    qu tɕʰˈy
    que tɕʰˈyɛ
    quan tɕʰˈyɛn
    qun tɕʰˈyn
    xi ɕˈi
    xia ɕˈja
    xiao ɕˈjaʊ
    xie ɕˈjɛ
    xiu ɕˈjoʊ
    xian ɕˈjɛn
    xin ɕˈin
    xiang ɕˈiɑŋ
    xing ɕˈiŋ
    xiong ɕˈjʊŋ
    xu ɕˈy
    xue ɕˈyɛ
    xuan ɕˈyɛn
    xun ɕˈyn
    ga kˈa
    ge kˈɤ
    gai kˈaɪ
    gei kˈeɪ
    gao kˈaʊ
    gou kˈoʊ
    gan kˈan
    gen kˈən
    gang kˈɑŋ
    geng kˈəŋ
    gong kˈʊŋ
    gu kˈu
    gua kˈwa
    guo kˈwɔ
    guai kˈwaɪ
    gui kˈweɪ
    guan kˈwan
    gun kˈwən
    guang kˈwɑŋ
    ka kʰˈa
    ke kʰˈɤ
    kai kʰˈaɪ
    kei kʰˈeɪ
    kao kʰˈaʊ
    kou kʰˈoʊ
    kan kʰˈan
    ken kʰˈən
    kang kʰˈɑŋ
    keng kʰˈəŋ
    kong kʰˈʊŋ
    ku kʰˈu
    kua kʰˈwa
    kuo kʰˈwɔ
    kuai kʰˈwaɪ
    kui kʰˈweɪ
    kuan kʰˈwan
    kun kʰˈwən
    kuang kʰˈwɑŋ
    ha xˈa
    he xˈɤ
    hai xˈaɪ
    hei xˈeɪ
    hao xˈaʊ
    hou xˈoʊ
    han xˈan
    hen xˈən
    hang xˈɑŋ
    heng xˈəŋ
    hong xˈʊŋ
    hu xˈu
    hua xˈwa
    huo xˈwɔ
    huai xˈwaɪ
    hui xˈweɪ
    huan xˈwan
    hun xˈwən
    huang xˈwɑŋ
    a ˈa
    o ˈo
    e ˈɤ
    er ˈɚ
    ai ˈaɪ
    ei ˈeɪ
    ao ˈaʊ
    ou ˈoʊ
    an ˈan
    en ˈən
    ang ˈɑŋ
    eng ˈəŋ
    yi ˈi
    ya jˈa
    yao jˈaʊ
    ye jˈɛ
    you jˈoʊ
    yan jˈɛn
    yin ˈin
    yang jˈɑŋ
    ying ˈiŋ
    yong ˈjʊŋ
    wu ˈu
    wa wˈa
    wo wˈɔ
    wai wˈaɪ
    wei wˈeɪ
    wan wˈan
    wen wˈən
    wang wˈɑŋ
    weng wˈəŋ
    yu ˈy
    yue ɥˈɛ
    yuan ɥˈɛn
    yun ɥˈn
    hair xˈaɹ
    dianr tˈjaɹ
    wanr wˈaɹ
    nar nˈaɹ
    yanr jˈaɹ
    huor xˈwɔɹ
    duanr tˈwaɹ
    lir lˈjɚ
    huir xˈwjɚ
    zher ʈʂˈɚ
    dour xˈɔɹ
    weir wˈɚ
    kuair kʰˈwaɹ
    guanr gˈwɐʴ
    shir ʂˈɚ
    yuanr ɥˈɚ
    jianr tɕˈjɚ
    her xˈɚ
    jiar tɕˈjaɹ

bor pˈwɔɹ xir ɕˈɚ bianr pˈjɚ fenr fˈɚ wenr wˈɚ der tˈɚ por pʰˈwɔɹ yuer ɥˈɚ mingr mˈjɚ char ʈʂʰˈaɹ xingr ɕˈjɚ zhour ʈʂˈoʊɹ shour ʂˈoʊɹ ter tʰˈɚ yingr ˈjɚ paor pʰˈaɹ fangr fˈɑɹ jingr tɕˈjɚ shur ʂˈuɹ qunr tɕʰˈyɹ hur xˈuɹ miaor mˈjaʊɹ biaor pˈjaʊɹ zhengr ʈʂˈɚ gour kˈoʊɹ pair pʰˈaɹ renr ɹˈɚ gaor kˈaʊɹ lo lˈoʊ tuir tʰˈwɚ huanr xˈwaɹ genr kˈɚ nvr nˈyɹ qianr tɕʰˈjɚ hangr xˈɑɹ chenr ʈʂʰˈɚ den tˈɚ lar lˈaɹ niur nˈjoʊɹ liur lˈjoʊɹ tunr tʰˈwɚ lunr lˈwɚ tour tʰˈoʊɹ hour xˈoʊɹ tianr tʰˈjɚ mianr mˈjɚ mar mˈaɹ pianr pʰˈjɚ maor mˈaʊɹ cair tsʰˈɚ far fˈaɹ shuor ʂˈwɔɹ kanr kʰˈaɹ banr pˈaɹ ger kˈɚ sher ʂˈɚ gunr kˈwɚ beir pˈɚ chuanr ʈʂʰˈwɚ bar pˈaɹ cunr tsʰˈwɚ tiaor tʰˈjaʊɹ shuar ʂˈwaɹ tur tʰˈuɹ zhaor ʈʂˈaʊɹ cher ʈʂʰˈɚ menr mˈɚ qingr tɕʰˈjɚ shanr ʂˈaɹ mor mˈwɔɹ zhur ʈʂˈuɹ wangr wˈɑɹ zhunr ʈʂˈwɚ zhir ʈʂˈɚ haor xˈaʊɹ shuir ʂˈwɚ guor kˈwɔɹ zaor tsˈaʊɹ juanr tɕˈyɚ jiar tɕˈjaɹ xiaor ɕˈjaʊɹ suor sˈwɔɹ shaor ʂˈaʊɹ yir ˈɚ dir tˈɚ ganr kˈaɹ duir tˈwɚ taor tʰˈaʊɹ lianr lˈjɚ benr pˈɚ fanr fˈaɹ xuer ɕˈyɚ pur pʰˈuɹ jinr tɕˈɚ kour kʰˈoʊɹ ker kʰˈɚ mur mˈuɹ liaor lˈjaʊɹ juer tɕˈyɚ your jˈoʊɹ xianr ɕˈjɚ quanr tɕʰˈyɚ yo jˈoʊ sanr sˈaɹ zhuor ʈʂˈwɔɹ tuor tʰˈwɔɹ naor nˈaʊɹ dar tˈaɹ fur fˈuɹ dunr tˈwɚ langr lˈɑɹ dair tˈaɹ huar xˈwaɹ yangr jˈɑɹ

2. You need to add a tone embedding for languages like Chinese and Japanese. For example, replacing the [ProsodyPredictor ](https://github.com/yl4579/StyleTTS/blob/main/models.py#L503) with the following code (i.e. concatenating the prosody embedding with the text embedding):
```python
class ProsodyPredictor(nn.Module):

    def __init__(self, n_prods, prod_embd, style_dim, d_hid, nlayers, dropout=0.1):
        super().__init__() 
        self.embedding = nn.Embedding(n_prods, prod_embd * 2)
        self.text_encoder = DurationEncoder(sty_dim=style_dim, 
                                            d_model=d_hid,
                                            nlayers=nlayers, 
                                            dropout=dropout)

        self.lstm = nn.LSTM(d_hid + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
        self.duration_proj = LinearNorm(d_hid, 1)

        self.lstm = nn.LSTM(d_hid + prod_embd * 2 + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
        self.duration_proj = LinearNorm(d_hid, 1)

        self.shared = nn.LSTM(d_hid + prod_embd * 2 + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
        self.F0 = nn.ModuleList()
        self.F0.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
        self.F0.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
        self.F0.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))

        self.N = nn.ModuleList()
        self.N.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
        self.N.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
        self.N.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))

        self.F0_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)
        self.N_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)

    def forward(self, texts, prosody, style, text_lengths, alignment, m):
        prosody = self.embedding(prosody)
        texts = torch.cat([texts, prosody], axis=1)
        d = self.text_encoder(texts, style, text_lengths, m)

        batch_size = d.shape[0]
        text_size = d.shape[1]

        # predict duration
        input_lengths = text_lengths.cpu().numpy()
        x = nn.utils.rnn.pack_padded_sequence(
            d, input_lengths, batch_first=True, enforce_sorted=False)

        m = m.to(text_lengths.device).unsqueeze(1)

        self.lstm.flatten_parameters()
        x, _ = self.lstm(x)
        x, _ = nn.utils.rnn.pad_packed_sequence(
            x, batch_first=True)

        x_pad = torch.zeros([x.shape[0], m.shape[-1], x.shape[-1]])

        x_pad[:, :x.shape[1], :] = x
        x = x_pad.to(x.device)

        duration = self.duration_proj(nn.functional.dropout(x, 0.5, training=self.training))

        en = (d.transpose(-1, -2) @ alignment)

        return duration.squeeze(-1), en

    def F0Ntrain(self, x, s):
        x, _ = self.shared(x.transpose(-1, -2))

        F0 = x.transpose(-1, -2)
        for block in self.F0:
            F0 = block(F0, s)
        F0 = self.F0_proj(F0)

        N = x.transpose(-1, -2)
        for block in self.N:
            N = block(N, s)
        N = self.N_proj(N)

        return F0.squeeze(1), N.squeeze(1)

    def length_to_mask(self, lengths):
        mask = torch.arange(lengths.max()).unsqueeze(0).expand(lengths.shape[0], -1).type_as(lengths)
        mask = torch.gt(mask+1, lengths.unsqueeze(1))
        return mask
  1. Modify meldataset.py to return the tones for each IPA and change your train_list.txt in the following format:
    data/aishell/train/wav/SSB1100/SSB11000297.wav|$ʈʂˈɑŋxˈweɪˈi ʈʂʰˈujˈɛntˈɤ tˈjɛnˈin jˈoʊʂˈənmˈɤ$|X111114444422 111113333555 44444333 33332222555X|382
    data/aishell/train/wav/SSB1567/SSB15670392.wav|$ʂˈʐ fˈuʂˈʐ ʂˈʐlˈɤ ʈʂˈəŋtʰˈi jˈɛˈu$fˈaʈʂˈantˈɤ xˈɤɕˈin tɕʰˈytˈʊŋlˈi$|X444 444444 111444 222223333 44444 11133333555 2221111 111114444444X|274
    data/aishell/train/wav/SSB0603/SSB06030228.wav|$xˈwɔtˈjɛn tˈəŋ tˈwɔxˈɑŋjˈɛ tɕˈiɑŋʂˈoʊ pˈwɔtɕˈi$|X333344444 3333 11112222444 1111114444 11112222X|223
    data/aishell/train/wav/SSB0588/SSB05880296.wav|$ˈinɥˈɛ ˈiʂˈəŋ sˈwɔˈaɪ$|X111444 441111 3333444X|378
    data/aishell/train/wav/SSB0315/SSB03150316.wav|$ʈʂʰˈuɕˈyɛʈʂˈɤ kʰˈɤ ʂˈʐˈjʊŋ tɕˈjaʊʈʂʰˈɑŋtˈɤ ˈiɕˈjɛ tʰˈjaʊʂˈəŋ$|X1111122223333 2222 3334444 444444222222555 441111 4444442222X|241
    data/aishell/train/wav/SSB0631/SSB06310452.wav|$ɕˈjɛntsˈaɪ tɕˈitɕʰˈi ɕˈyɛxˈweɪ kˈənɹˈən kˈoʊtʰˈʊŋ$|X4444444444 111144444 222244444 11112222 111111111X|229
    data/aishell/train/wav/SSB1935/SSB19350402.wav|$xˈwansˈwɔtˈɤ ʂˈʐ tɕʰˈyʂˈʐ lˈaʊpˈaɹtˈɤsˈwən tsˈɹ$|X444443333555 444 44444444 3333444455511111 5555X|345
    data/aishell/train/wav/SSB1203/SSB12030292.wav|$pˈiɹˈu tsˈweɪtɕˈin sˈannˈjɛn tɕˈiŋˈiŋ ʈʂˈwɑŋkʰˈwɑŋ lˈiɑŋxˈaʊtˈəŋ$|X333222 44444444444 111122222 11111222 444444444444 2222222223333X|377
    data/aishell/train/wav/SSB1024/SSB10240312.wav|$xˈaˈɚ pˈinʂˈʐ tˈiˈu sˈɹʈʂˈʊŋɕˈyɛtˈɤ ʈʂˈaʊpʰˈaɪ ˈy pʰˈɑŋpˈjɛn ʂˈɑŋxˈu ɕˈiɑŋpˈi$|X11133 1111444 44433 444111112222555 1111155555 33 2222211111 1111444 11111333X|231
    data/jvs_ver1/jvs088/parallel100/wav24kHz16bit/VOICEACTRESS100_037.wav|$kˈomˈʲɯːɴ ɯˈa $ sˈeːnˈɯ gˈaɯˈa tˈo $ esˈo ɴ nˈɯ kˈaɯˈa nˈo $ gˈoːɽˈʲɯː tɕˈitˈeɴ tˈo nˈaʔ tˈe iɽˈɯ$|XLLLHHHHLL LLL X LLLHHHH LLLLLL LLL X LHHH H HHH LLLLLL LLL X LLLHHHHHH HHHHLLLL LLL HHHL LLL LHHHX|88
    data/aishell/train/wav/SSB0671/SSB06710188.wav|$tɕˈiɑŋɕˈjɛn nˈanfˈan ˈjʊŋxˈʊŋ fˈɑŋɕˈin lˈiɑŋjˈoʊ ʈʂˈʊŋɕˈintˈjɛn$|X44444444444 22222222 33332222 44441111 222222222 11111111144444X|363
    data/aishell/train/wav/SSB0380/SSB03800184.wav|$kʰˈɤ ɥˈɛxˈan tɕˈjoʊʂˈʐ tʰˈiŋpˈu tɕˈintɕʰˈy$|X3333 1114444 444444444 11111222 4444444444X|323
    data/aishell/train/wav/SSB0760/SSB07600247.wav|$tˈɑŋɹˈan wˈɔ ɕˈjɛntsˈaɪ ˈitɕˈiŋ mˈeɪjˈoʊ ʈʂˈɤkˈɤ tsˈɹkˈɤ tsˈaɪkˈən nˈiʂˈwɔ ʈʂˈɤkˈɤ xˈwaɹ$|X11112222 333 4444444444 3311111 22223333 4444444 1111222 444441111 3331111 4444444 44444X|237
    data/aishell/train/wav/SSB0016/SSB00160083.wav|$pˈaʂˈʐˈutˈjɛn lˈjoʊlˈiŋtɕʰˈi$|X1112222233333 44444222211111X|245

    where X and $ represent the SOS and EOS.

I'll leave this issue open for someone to fork the repo and modify it for Mandarin and Japanese support. I'm unfortunately too busy to work on it now.

yl4579 commented 1 year ago

For Japanese, you can do the same thing:

The conversion table from kana to IPA is the following (again phonemizer doesn't work for me).

kana_mapper = OrderedDict([
    ("ゔぁ","bˈa"),
    ("ゔぃ","bˈi"),
    ("ゔぇ","bˈe"),
    ("ゔぉ","bˈo"),
    ("ゔゃ","bˈʲa"),
    ("ゔゅ","bˈʲɯ"),
    ("ゔゃ","bˈʲa"),
    ("ゔょ","bˈʲo"),

    ("ゔ","bˈɯ"),

    ("あぁ","aː"),
    ("いぃ","iː"),
    ("いぇ","je"),
    ("いゃ","ja"),
    ("うぅ","ɯː"),
    ("えぇ","eː"),
    ("おぉ","oː"),
    ("かぁ","kˈaː"),
    ("きぃ","kˈiː"),
    ("くぅ","kˈɯː"),
    ("くゃ","kˈa"),
    ("くゅ","kˈʲɯ"),
    ("くょ","kˈʲo"),
    ("けぇ","kˈeː"),
    ("こぉ","kˈoː"),
    ("がぁ","gˈaː"),
    ("ぎぃ","gˈiː"),
    ("ぐぅ","gˈɯː"),
    ("ぐゃ","gˈʲa"),
    ("ぐゅ","gˈʲɯ"),
    ("ぐょ","gˈʲo"),
    ("げぇ","gˈeː"),
    ("ごぉ","gˈoː"),
    ("さぁ","sˈaː"),
    ("しぃ","ɕˈiː"),
    ("すぅ","sˈɯː"),
    ("すゃ","sˈʲa"),
    ("すゅ","sˈʲɯ"),
    ("すょ","sˈʲo"),
    ("せぇ","sˈeː"),
    ("そぉ","sˈoː"),
    ("ざぁ","zˈaː"),
    ("じぃ","dʑˈiː"),
    ("ずぅ","zˈɯː"),
    ("ずゃ","zˈʲa"),
    ("ずゅ","zˈʲɯ"),
    ("ずょ","zˈʲo"),
    ("ぜぇ","zˈeː"),
    ("ぞぉ","zˈeː"),
    ("たぁ","tˈaː"),
    ("ちぃ","tɕˈiː"),
    ("つぁ","tsˈa"),
    ("つぃ","tsˈi"),
    ("つぅ","tsˈɯː"),
    ("つゃ","tɕˈa"),
    ("つゅ","tɕˈɯ"),
    ("つょ","tɕˈo"),
    ("つぇ","tsˈe"),
    ("つぉ","tsˈo"),
    ("てぇ","tˈeː"),
    ("とぉ","tˈoː"),
    ("だぁ","dˈaː"),
    ("ぢぃ","dʑˈiː"),
    ("づぅ","dˈɯː"),
    ("づゃ","zˈʲa"),
    ("づゅ","zˈʲɯ"),
    ("づょ","zˈʲo"),
    ("でぇ","dˈeː"),
    ("どぉ","dˈoː"),
    ("なぁ","nˈaː"),
    ("にぃ","nˈiː"),
    ("ぬぅ","nˈɯː"),
    ("ぬゃ","nˈʲa"),
    ("ぬゅ","nˈʲɯ"),
    ("ぬょ","nˈʲo"),
    ("ねぇ","nˈeː"),
    ("のぉ","nˈoː"),
    ("はぁ","hˈaː"),
    ("ひぃ","çˈiː"),
    ("ふぅ","ɸˈɯː"),
    ("ふゃ","ɸˈʲa"),
    ("ふゅ","ɸˈʲɯ"),
    ("ふょ","ɸˈʲo"),
    ("へぇ","hˈeː"),
    ("ほぉ","hˈoː"),
    ("ばぁ","bˈaː"),
    ("びぃ","bˈiː"),
    ("ぶぅ","bˈɯː"),
    ("ふゃ","ɸˈʲa"),
    ("ぶゅ","bˈʲɯ"),
    ("ふょ","ɸˈʲo"),
    ("べぇ","bˈeː"),
    ("ぼぉ","bˈoː"),
    ("ぱぁ","pˈaː"),
    ("ぴぃ","pˈiː"),
    ("ぷぅ","pˈɯː"),
    ("ぷゃ","pˈʲa"),
    ("ぷゅ","pˈʲɯ"),
    ("ぷょ","pˈʲo"),
    ("ぺぇ","pˈeː"),
    ("ぽぉ","pˈoː"),
    ("まぁ","mˈaː"),
    ("みぃ","mˈiː"),
    ("むぅ","mˈɯː"),
    ("むゃ","mˈʲa"),
    ("むゅ","mˈʲɯ"),
    ("むょ","mˈʲo"),
    ("めぇ","mˈeː"),
    ("もぉ","mˈoː"),
    ("やぁ","jˈaː"),
    ("ゆぅ","jˈɯː"),
    ("ゆゃ","jˈaː"),
    ("ゆゅ","jˈɯː"),
    ("ゆょ","jˈoː"),
    ("よぉ","jˈoː"),
    ("らぁ","ɽˈaː"),
    ("りぃ","ɽˈiː"),
    ("るぅ","ɽˈɯː"),
    ("るゃ","ɽˈʲa"),
    ("るゅ","ɽˈʲɯ"),
    ("るょ","ɽˈʲo"),
    ("れぇ","ɽˈeː"),
    ("ろぉ","ɽˈoː"),
    ("わぁ","ɯˈaː"),
    ("をぉ","oː"),

    ("う゛","bˈɯ"),
    ("でぃ","dˈi"),
    ("でぇ","dˈeː"),
    ("でゃ","dˈʲa"),
    ("でゅ","dˈʲɯ"),
    ("でょ","dˈʲo"),
    ("てぃ","tˈi"),
    ("てぇ","tˈeː"),
    ("てゃ","tˈʲa"),
    ("てゅ","tˈʲɯ"),
    ("てょ","tˈʲo"),
    ("すぃ","sˈi"),
    ("ずぁ","zˈɯa"),
    ("ずぃ","zˈi"),
    ("ずぅ","zˈɯ"),
    ("ずゃ","zˈʲa"),
    ("ずゅ","zˈʲɯ"),
    ("ずょ","zˈʲo"),
    ("ずぇ","zˈe"),
    ("ずぉ","zˈo"),
    ("きゃ","kˈʲa"),
    ("きゅ","kˈʲɯ"),
    ("きょ","kˈʲo"),
    ("しゃ","ɕˈʲa"),
    ("しゅ","ɕˈʲɯ"),
    ("しぇ","ɕˈʲe"),
    ("しょ","ɕˈʲo"),
    ("ちゃ","tɕˈa"),
    ("ちゅ","tɕˈɯ"),
    ("ちぇ","tɕˈe"),
    ("ちょ","tɕˈo"),
    ("とぅ","tˈɯ"),
    ("とゃ","tˈʲa"),
    ("とゅ","tˈʲɯ"),
    ("とょ","tˈʲo"),
    ("どぁ","dˈoa"),
    ("どぅ","dˈɯ"),
    ("どゃ","dˈʲa"),
    ("どゅ","dˈʲɯ"),
    ("どょ","dˈʲo"),
    ("どぉ","dˈoː"),
    ("にゃ","nˈʲa"),
    ("にゅ","nˈʲɯ"),
    ("にょ","nˈʲo"),
    ("ひゃ","çˈʲa"),
    ("ひゅ","çˈʲɯ"),
    ("ひょ","çˈʲo"),
    ("みゃ","mˈʲa"),
    ("みゅ","mˈʲɯ"),
    ("みょ","mˈʲo"),
    ("りゃ","ɽˈʲa"),
    ("りぇ","ɽˈʲe"),
    ("りゅ","ɽˈʲɯ"),
    ("りょ","ɽˈʲo"),
    ("ぎゃ","gˈʲa"),
    ("ぎゅ","gˈʲɯ"),
    ("ぎょ","gˈʲo"),
    ("ぢぇ","dʑˈe"),
    ("ぢゃ","dʑˈa"),
    ("ぢゅ","dʑˈɯ"),
    ("ぢょ","dʑˈo"),
    ("じぇ","dʑˈe"),
    ("じゃ","dʑˈa"),
    ("じゅ","dʑˈɯ"),
    ("じょ","dʑˈo"),
    ("びゃ","bˈʲa"),
    ("びゅ","bˈʲɯ"),
    ("びょ","bˈʲo"),
    ("ぴゃ","pˈʲa"),
    ("ぴゅ","pˈʲɯ"),
    ("ぴょ","pˈʲo"),
    ("うぁ","ɯˈa"),
    ("うぃ","ɯˈi"),
    ("うぇ","ɯˈe"),
    ("うぉ","ɯˈo"),
    ("うゃ","ɯˈʲa"),
    ("うゅ","ɯˈʲɯ"),
    ("うょ","ɯˈʲo"),
    ("ふぁ","ɸˈa"),
    ("ふぃ","ɸˈi"),
    ("ふぅ","ɸˈɯ"),
    ("ふゃ","ɸˈʲa"),
    ("ふゅ","ɸˈʲɯ"),
    ("ふょ","ɸˈʲo"),
    ("ふぇ","ɸˈe"),
    ("ふぉ","ɸˈo"),

    ("あ","a"),
    ("い","i"),
    ("う","ɯ"),
    ("え","e"),
    ("お","o"),
    ("か","kˈa"),
    ("き","kˈi"),
    ("く","kˈɯ"),
    ("け","kˈe"),
    ("こ","kˈo"),
    ("さ","sˈa"),
    ("し","ɕˈi"),
    ("す","sˈɯ"),
    ("せ","sˈe"),
    ("そ","sˈo"),
    ("た","tˈa"),
    ("ち","tɕˈi"),
    ("つ","tsˈɯ"),
    ("て","tˈe"),
    ("と","tˈo"),
    ("な","nˈa"),
    ("に","nˈi"),
    ("ぬ","nˈɯ"),
    ("ね","nˈe"),
    ("の","nˈo"),
    ("は","hˈa"),
    ("ひ","çˈi"),
    ("ふ","ɸˈɯ"),
    ("へ","hˈe"),
    ("ほ","hˈo"),
    ("ま","mˈa"),
    ("み","mˈi"),
    ("む","mˈɯ"),
    ("め","mˈe"),
    ("も","mˈo"),
    ("ら","ɽˈa"),
    ("り","ɽˈi"),
    ("る","ɽˈɯ"),
    ("れ","ɽˈe"),
    ("ろ","ɽˈo"),
    ("が","gˈa"),
    ("ぎ","gˈi"),
    ("ぐ","gˈɯ"),
    ("げ","gˈe"),
    ("ご","gˈo"),
    ("ざ","zˈa"),
    ("じ","dʑˈi"),
    ("ず","zˈɯ"),
    ("ぜ","zˈe"),
    ("ぞ","zˈo"),
    ("だ","dˈa"),
    ("ぢ","dʑˈi"),
    ("づ","zˈɯ"),
    ("で","dˈe"),
    ("ど","dˈo"),
    ("ば","bˈa"),
    ("び","bˈi"),
    ("ぶ","bˈɯ"),
    ("べ","bˈe"),
    ("ぼ","bˈo"),
    ("ぱ","pˈa"),
    ("ぴ","pˈi"),
    ("ぷ","pˈɯ"),
    ("ぺ","pˈe"),
    ("ぽ","pˈo"),
    ("や","jˈa"),
    ("ゆ","jˈɯ"),
    ("よ","jˈo"),
    ("わ","ɯˈa"),
    ("ゐ","i"),
    ("ゑ","e"),
    ("ん","ɴ"),
    ("っ","ʔ"),
    ("ー","ː"),

    ("ぁ","a"),
    ("ぃ","i"),
    ("ぅ","ɯ"),
    ("ぇ","e"),
    ("ぉ","o"),
    ("ゎ","ɯˈa"),
    ("ぉ","o"),

    ("を","o")
])

nasal_sound = OrderedDict([
    # before m, p, b
    ("ɴm","mm"),
    ("ɴb", "mb"),
    ("ɴp", "mp"),

    # before k, g
    ("ɴk","ŋk"),
    ("ɴg", "ŋg"),

    # before t, d, n, s, z, ɽ
    ("ɴt","nt"),
    ("ɴd", "nd"),
    ("ɴn","nn"),
    ("ɴs", "ns"),
    ("ɴz","nz"),
    ("ɴɽ", "nɽ"),

    ("ɴɲ", "ɲɲ"),

])

def hiragana2IPA(text):
    orig = text

    for k, v in kana_mapper.items():
        text = text.replace(k, v)

    for k, v in nasal_sound.items():
        text = text.replace(k, v)

    return text

You also need to add the intonations for each word with Open JTalk.

data/jvs_ver1/jvs020/falset10/wav24kHz16bit/VOICEACTRESS100_005.wav|$ɕˈiɽˈɯbˈaː sˈaː ɸˈaː ɕˈʲɯːgˈekˈi dʑˈikˈeɴ mˈadˈe nˈi $ ɽˈitɕˈaːzˈɯ ɯˈa $ tɕˈiːmˈɯ mˈeː tˈo tˈomˈonˈi $ kˈokˈɯsˈai tˈekˈi nˈi sˈɯːpˈaː çˈiːɽˈoː$ ojˈobˈi $ jˈɯːmˈeːdʑˈiɴ tˈo ɕˈi tˈe $ nˈiɴtɕˈi sˈa ɽˈe tˈe iɽˈɯ$|XHHHLLLLLLL LLLH HHHH HHHHHHHHHHH HHHHLLLL LLLLLL LLL X HHHLLLLLLLL LLL X LLLLHHHH LLLL LLL LLLHHHHHH X LLLHHHHHHH LLLLLL LLL LLLHHHHH HHHLLLLLX HLLLLLL X LLLHHHHLLLLLL LLL LLL HHH X HHHLLLLL LLL HHH HHH LHHHX
data/jvs_ver1/jvs081/parallel100/wav24kHz16bit/VOICEACTRESS100_078.wav|$ɸˈʲoːgˈeɴ gˈʲoːɽˈetsˈɯ nˈo ɕˈiɸˈʲoː ɸˈʲoː o$ bˈɯɴɕˈi nˈo tˈaiɕˈʲoː sˈeː o aɽˈaɯˈasˈɯ $ tˈeɴ gˈɯɴ nˈo ɕˈiɸˈʲoː ɸˈʲoː o mˈotɕˈiː tˈe $ sˈɯɴdˈe jˈakˈɯ ɸˈʲoːgˈeɴ e bˈɯɴkˈai sˈɯɽˈɯ$|XLLLLHHHHH HHHHLLLLLLLL LLL LLLHHHHH HHHHH HX HHHLLLL LLL LLLHHHHHH HHHH H LHHHHHHLLL X LLLH LLLL LLL LLLHHHHH HHHHH H LLLHHHHH LLL X LLLHHHH LLLHHH HHHHHHHHL L LLLHHHHH LLLHHHX

where L and H represent low tone and high tone, respectively.

c9412600 commented 1 year ago

data/VCTK-Corpus/VCTK-Corpus/wav24/p275/p275_380.wav|$ɪts ɐ ɹˈiːəl pɹˈɑːbləm$$|XXXX X XXXXXX XXXXXXXXXXX|155 hello,I want to know what "XXXX X XXXXXX XXXXXXXXXXX" and 155 mean. Thanks!

yl4579 commented 1 year ago

@c9412600 That was a typo that should not be included, I have fixed it. 155 is the speaker id (never used during training, just for clarification), and X means no intonation (in contrast to 1, 2, 3, 4, 5 that represent the actual tones in Mandarin).

liuhuang31 commented 1 year ago

@yl4579 Thank you for sharing so many ideas! Use Aishell3 dataset, I can synthesize normal audio, and it sounds good.

But when generate a unseen speaker, the timbre doesn't sound like its origin, is there any way to improve its timbre similarity to unseen speaker?

CONGLUONG12 commented 1 year ago

@yl4579 I would like to ask if there is any change to the Vietnamese language

yl4579 commented 1 year ago

@CONGLUONG12 I don't think there is any change needed for Vietnamese. You only need to find a conversion table between chu quoc ngu and IPA (maybe phonemizer works for this case?) and label the tones (there should be six of them, so n_prods = 6) as in Mandarin.

MMMMichaelzhang commented 1 year ago

I have some questions about how to inference in mandarin . First ,I am not sure if it is right for mandarin :

_pad = "$" _punctuation = ';:,.!?¡¿—…"«»“” ' _letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' _letters_ipa = "ɹʂʴʰɛɤɔʈɚˈɕɥɐɑɪŋʐʊə"

Second: ps = global_phonemizer.phonemize([text]) do i need to add tone in ps,like '$pˈu ʈʂˈʐ tˈaʊ nˈi ʂˈwɔ tˈɤ ʂˈʐ pˈu ʂˈʐ wˈɔ ɕˈiɑŋ tˈɤ$|X444 1111 4444 333 1111 555 444 444 444 333 33333 555X'

if my asr trained with pinyin(like 'wo3 shi4 shui2') not ipa, is it ok for the inference? THANK YOU for the great work! @yl4579 @yl4579 @yl4579

liuhuang31 commented 1 year ago

I use pinyin for asr and styletts, can generate a normal and good results.

MMMMichaelzhang commented 1 year ago

I use pinyin for asr and styletts, can generate a normal and good results.

could you share some details like: how to set inference file

_pad = "$" _punctuation = ';:,.!?¡¿—…"«»“” ' _letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' _letters_ipa = "1234" is it right? and if you need to replace class ProsodyPredictor like the author said? @liuhuang31

liuhuang31 commented 1 year ago

For mandarin, i didn't use ipa_phonemes, use pinyin's initials and finals phonemes.

  1. You can use pypinyin to generate pinyin.
  2. The _initials and _finals used in pypinyin, then the symbols is below:

_pause = ["sil", "eos", "sp", ...] _initials = ["b", "c","ch", ...] _finals = ["a", "ai", ...] _tones = ["1", "2", "3", "4", "5"] symbols = _pause + _initials + [i + j for i in _finals for j in _tones]

MMMMichaelzhang commented 1 year ago

For mandarin, i didn't use ipa_phonemes, use pinyin's initials and finals phonemes.

  1. You can use pypinyin to generate pinyin.
  2. The _initials and _finals used in pypinyin, then the symbols is below:

_pause = ["sil", "eos", "sp", ...] _initials = ["b", "c","ch", ...] _finals = ["a", "ai", ...] _tones = ["1", "2", "3", "4", "5"] symbols = _pause + _initials + [i + j for i in _finals for j in _tones]

thank you very much! do you change <class ProsodyPredictor(nn.Module) > in models? @liuhuang31

liuhuang31 commented 1 year ago

sorry to forget to reply, i didn't change <class ProsodyPredictor(nn.Module) > in models.

JohnHerry commented 1 year ago

sorry to forget to reply, i didn't change <class ProsodyPredictor(nn.Module) > in models.

Hi, liuhuang31 How did you train the Chinese pinyin PL-BERT model? to treat the ShengMu, YunMu, YinDiao as separate phonemes? or see the whole pinyin as a single phoneme? and, how did you get so much annotated Chinese text corpus? As I know, the Pypinyin generated pinyin are error-prone, I do not think it a good way to get the PL-BERT corpus.

liuhuang31 commented 1 year ago

sorry to forget to reply, i didn't change <class ProsodyPredictor(nn.Module) > in models.

Hi, liuhuang31 How did you train the Chinese pinyin PL-BERT model? to treat the ShengMu, YunMu, YinDiao as separate phonemes? or see the whole pinyin as a single phoneme? and, how did you get so much annotated Chinese text corpus? As I know, the Pypinyin generated pinyin are error-prone, I do not think it a good way to get the PL-BERT corpus.

HI, JohnHerry (1) I didn't train phoneme level bert model. In the below, ShengMu is _initials, YunMu is _finals, YinDiao is _tones. For text features: phoneme, prosody, tone. phoneme features treat the ShengMu, YunMu as separate phonemes.

For example, give a text “去上学校”: First we generate its prosody: “去上学校” -> “去#1上#1学校#4.” Second use pypinyin to generate chinese pinyin: “去#1上#1学校#4.” -> “去#1上#1学校#4.|qu5 shang5 xue3 xiao3” Third generate its text features(phoneme, prosody, tone): “去#1上#1学校#4.|qu5 shang5 xue3 xiao3” -> "q u sh ang x ue x iao|#1 #1 #1 #1 #0 #0 #4 #4|5 5 5 5 3 3 3 3". Certainly, you should convert phoneme, prosody and tone to id.

_pause = ["sil", "eos", "sp", ...]
_initials = ["b", "c","ch", ...]
_finals = ["a", "ai", ...]
_tones = ["1", "2", "3", "4", "5"]
symbols = _pause + _initials + [i + j for i in _finals for j in _tones]

(2) As for me, just use the open dataset: aishell3 dataset(zhvoice dataset also can use, but its quality is very poor).

(3) Yes, "Pypinyin generated pinyin are error-prone", but in my view, if the dataset is big enough, the error will be average and "eliminate". Also in my experiment use aishell3 dataset, i can generate a normal audio, which sound not bad.

JohnHerry commented 1 year ago

sorry to forget to reply, i didn't change <class ProsodyPredictor(nn.Module) > in models.

Hi, liuhuang31 How did you train the Chinese pinyin PL-BERT model? to treat the ShengMu, YunMu, YinDiao as separate phonemes? or see the whole pinyin as a single phoneme? and, how did you get so much annotated Chinese text corpus? As I know, the Pypinyin generated pinyin are error-prone, I do not think it a good way to get the PL-BERT corpus.

HI, JohnHerry (1) I didn't train phoneme level bert model. In the below, ShengMu is _initials, YunMu is _finals, YinDiao is _tones. For text features: phoneme, prosody, tone. phoneme features treat the ShengMu, YunMu as separate phonemes.

For example, give a text “去上学校”: First we generate its prosody: “去上学校” -> “去#1上#1学校#4.” Second use pypinyin to generate chinese pinyin: “去#1上#1学校#4.” -> “去#1上#1学校#4.|qu5 shang5 xue3 xiao3” Third generate its text features(phoneme, prosody, tone): “去#1上#1学校#4.|qu5 shang5 xue3 xiao3” -> "q u sh ang x ue x iao|#1 #1 #1 #1 #0 #0 #4 #4|5 5 5 5 3 3 3 3". Certainly, you should convert phoneme, prosody and tone to id.

_pause = ["sil", "eos", "sp", ...]
_initials = ["b", "c","ch", ...]
_finals = ["a", "ai", ...]
_tones = ["1", "2", "3", "4", "5"]
symbols = _pause + _initials + [i + j for i in _finals for j in _tones]

(2) As for me, just use the open dataset: aishell3 dataset(zhvoice dataset also can use, but its quality is very poor).

(3) Yes, "Pypinyin generated pinyin are error-prone", but in my view, if the dataset is big enough, the error will be average and "eliminate". Also in my experiment use aishell3 dataset, i can generate a normal audio, which sound not bad.

Thanks for the detailed information. it helps me a lot.

zdj97 commented 1 year ago

hello, the pypinyin does not perform well someways. So i use another phoneme set, not like pypinyin. In this way, how can i prepare the filelists and how to train or finetune?

liuhuang31 commented 1 year ago

hello, the pypinyin does not perform well someways. So i use another phoneme set, not like pypinyin. In this way, how can i prepare the filelists and how to train or finetune?

Hi, zdj97

As pypinyin or any other phoneme set, their role is to convert text into phoneme, so just samply use the new phoneme set. And remember use the new phoneme set to re-train the asr model.

Zhongxu-Wang commented 1 year ago

I did try training for other languages including Mandarin, Japanese, Hindi etc., though it requires a few changes:

hello, what tools did you use to convert the LJspeech and LibriTTS databases to IPAs

zdj97 commented 1 year ago

hi, i did not convert LJspeech or VCTK to IPAs. So, i did not use the pretrained models in this scripts. I trained the ASR and pitch models from scrach using my own phoneme set and the results have not done. When the models are done, i will comments this.

yihuitang commented 1 year ago

Hi @yl4579 ,

A stupid question: how can I convert

SSB11000297|zhang1 hui4 yi2 % chu1 yan3 de5 % dian4 yin3 % you3 shen2 me5 $|

to

SSB11000297.wav|$ʈʂˈɑŋxˈweɪˈi ʈʂʰˈujˈɛntˈɤ tˈjɛnˈin jˈoʊʂˈənmˈɤ$|X111114444422 111113333555 44444333 33332222555X|

Is the conversion done by meldataset.py during training or do I need to write a preprocessor to convert it before training?

Thanks

yl4579 commented 1 year ago

@yihuitang You need to code it yourself because the meldataset.py was written for English support only. I have provided the conversion table, so it should not be difficult for you to convert it to the desired format. I couldn't find the exact code to generate the dataset unfortunately, but all you need is to split the text by space, get the number (tone), convert the pinyin to IPA using the table I provided and repeat the number (tone) for N times where N is the length of the phoneme.

yihuitang commented 1 year ago

@yl4579 thanks for your prompt reply. I'll start with the code of converting the format. Should there be a space between IPAs? Take zhang1 hui4 as an example, which of the following representations of IPA is the correct or best one?

1. ʈʂˈɑŋxˈweɪˈ (no sapce)
2. ʈʂˈɑŋ xˈweɪˈ (space between words)
3. ʈʂˈ ɑŋ xˈ weɪˈ (space between words and space between ShengMu and YunMu)
4. ʈ ʂˈ ɑ ŋ xˈ w e ɪˈ (space between each IPA)
yl4579 commented 1 year ago

@yihuitang In my case, I separated between words because I used the PL-BERT trained jointly with Chinese, Japanese, and English and word boundaries were used when pre-training the PL-BERT, but you may not need to do that. If you do not plan to use any language model or if your language model is at character level (for example, your grapheme in PL-BERT is the character instead of a word), I don't think there is any difference.

Note that words were separated by "%" in the AiShell dataset, so "zhang1 hui4 yi2" is one word, and "chu1 yan3 de5" is another word. This is why they were converted to "ʈʂˈɑŋxˈweɪˈi ʈʂʰˈujˈɛntˈɤ" in my case, where the only space is between these two words, not syllables.

yihuitang commented 1 year ago

@yl4579 , Thanks for your guidance. I do plan to use your PL-BERT later if I can successfully implement Mandarin in StyleTTS.

I would also like to train StyleTTS with customized data, which has no "%" in the dataset. So for the customized dataset with PL-BERT, I should use option 1 (no space). Am I right?

1. ʈʂˈɑŋxˈweɪˈ (no sapce)

yihuitang commented 1 year ago

Hi @yl4579 , a quick update:

I've created a script to convert pinyins to IPAs and get filelists in the desired format for Mandarin. Here are train and val lists for aishell3 dataset: train_list_aishell3.txt val_list_aishell3.txt

Class ProsodyPredictor is also updated with your code above. And then I tried to update meldataset.py but got stuck.

  1. What should n_prods and prod_embd be? Should they be stored in config.yml?
  2. I can get the tone for IPA, but where and how I should use the tone?
class FilePathDataset(torch.utils.data.Dataset):
    def __init__(self,
                 data_list,
                 sr=24000,
                 data_augmentation=False,
                 validation=False,
                 ):

        spect_params = SPECT_PARAMS
        mel_params = MEL_PARAMS

        #_data_list = [l[:-1].split('|') for l in data_list]
        _data_list = [l.split('|') for l in data_list]
        self.data_list = [data if len(data) == 4 else (*data, 0) for data in _data_list]
        self.text_cleaner = TextCleaner()
        self.sr = sr

        self.to_melspec = torchaudio.transforms.MelSpectrogram(**MEL_PARAMS)

        self.mean, self.std = -4, 4
        self.data_augmentation = data_augmentation and (not validation)
        self.max_mel_length = 192

#         self.global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True,  with_stress=True)

    def __len__(self):
        return len(self.data_list)

    def __getitem__(self, idx):
        data = self.data_list[idx]
        path = data[0]

        wave, text_tensor, tone_tensor, speaker_id = self._load_tensor(data)

        mel_tensor = preprocess(wave).squeeze()

        acoustic_feature = mel_tensor.squeeze()
        length_feature = acoustic_feature.size(1)
        acoustic_feature = acoustic_feature[:, :(length_feature - length_feature % 2)]

        return speaker_id, acoustic_feature, text_tensor, path

    def _load_tensor(self, data):
        wave_path, text, tone, speaker_id = data
        speaker_id = int(speaker_id)
        wave, sr = sf.read(wave_path)
        if wave.shape[-1] == 2:
            wave = wave[:, 0].squeeze()
        if sr != 24000:
            wave = librosa.resample(wave, sr, 24000)
            print(wave_path, sr)

        wave = np.concatenate([np.zeros([5000]), wave, np.zeros([5000])], axis=0)

        text = self.text_cleaner(text)
        tone = self.text_cleaner(tone)

        text.insert(0, 0)
        text.append(0)

        tone.insert(0, 0)
        tone.append(0)

        text = torch.LongTensor(text)
        tone = torch.LongTensor(tone)

        return wave, text, tone, speaker_id

    def _load_data(self, data):
        wave, text_tensor, tone, speaker_id = self._load_tensor(data)
        mel_tensor = preprocess(wave).squeeze()

        mel_length = mel_tensor.size(1)
        if mel_length > self.max_mel_length:
            random_start = np.random.randint(0, mel_length - self.max_mel_length)
            mel_tensor = mel_tensor[:, random_start:random_start + self.max_mel_length]

        return mel_tensor, speaker_id
leminhnguyen commented 1 year ago

@yl4579 I have the same question as @yihuitang. What is the reasonable prod_embd? And How to use it in training?

yl4579 commented 1 year ago

@yihuitang n_prods should be the number of tones (e.g., for Mandarin Chinese it should be 5, for Japanese it should be 2, for Cantonese it should be 6). The tones are represented as indices and encoded with one-hot encoding and converted to the prosody embedding self.embedding in the modified ProsodyPredictor. @yihuitang I used 128. It shouldn't matter that much to be honest.

JohnHerry commented 1 year ago

Then, may be somewhat out of this project, but is there any good text-to-prosody solution for Mandarin?

most of the text prosody models are BERT+, eg. BERT+Linear, BERT+BiLSTM+CRF, ..... But in our experiments, those models are not very good for mandarin, especially the #2, prosody segment, is very hard to predict. We had also tried other methods, none of them is good enough. I think it is because the #2 prosody is sparse in the training data, and so got slower convergence. We had tried to weight its loss function but got no improments. We tried cascade model structure like in "A Mandarin Prosodic Boundary Prediction Model Based on Multi-Task Learning " , it is also not very effective.

yl4579 commented 1 year ago

@JohnHerry I'm a little confused what #2 "prosody segment" is? I guess the issue (#2 ) only involves multilingual support for phonemization, not sure why it is related to prosody prediction. Do you mean the 2nd tone in Mandarin? From what I found online, it says "With regard to lexical tone, the falling Tone 4 is the most frequent (34.9%), followed by the stable high Tone 1 (24.8%) and rising Tone 2 (23.9%). The low-dipping Tone 3 is least frequent tone in our corpus (16.4%)." so I guess they are quite equally distributed.

yihuitang commented 1 year ago

@yl4579 where the one-hot encoding should happen? In the modified ProsodyPredictor or in the modified meldataset.py?

JohnHerry commented 1 year ago

@JohnHerry I'm a little confused what #2 "prosody segment" is? I guess the issue (#2 ) only involves multilingual support for phonemization, not sure why it is related to prosody prediction. Do you mean the 2nd tone in Mandarin? From what I found online, it says "With regard to lexical tone, the falling Tone 4 is the most frequent (34.9%), followed by the stable high Tone 1 (24.8%) and rising Tone 2 (23.9%). The low-dipping Tone 3 is least frequent tone in our corpus (16.4%)." so I guess they are quite equally distributed.

No, they are not tones in phoneme pinyin, they are text prosody labels. The #1 #2 and #3 are prosody tags on text, where:

1 is Prosodic Word (PW),

2 is Prosodic Phrase (PPH)

and #3 is Intonational Phrase (IPH) eg. 玄奘#1为保存#2由#1天竺#1经#1丝绸之路#2带回#1长安的#1经卷#1佛像#3主持#1修建了#1大雁塔#4。 They somewhat can be seen as a kind of speech pause levels, where #3 will get longer speech pause then #2, Text prosody labels can helps to generate better acounstic prosody in the synthesized speech. I had read your first answer of this issure, in the code of the "ProsodyPredictor" class, I think, the third parameter of the forward function, is somewhat like that text prosody.

GuangChen2016 commented 1 year ago

Is there any samples of from Mandrain corpus?

hdmjdp commented 1 year ago

sorry to forget to reply, i didn't change <class ProsodyPredictor(nn.Module) > in models.

Hi, liuhuang31 How did you train the Chinese pinyin PL-BERT model? to treat the ShengMu, YunMu, YinDiao as separate phonemes? or see the whole pinyin as a single phoneme? and, how did you get so much annotated Chinese text corpus? As I know, the Pypinyin generated pinyin are error-prone, I do not think it a good way to get the PL-BERT corpus.

HI, JohnHerry (1) I didn't train phoneme level bert model. In the below, ShengMu is _initials, YunMu is _finals, YinDiao is _tones. For text features: phoneme, prosody, tone. phoneme features treat the ShengMu, YunMu as separate phonemes.

For example, give a text “去上学校”: First we generate its prosody: “去上学校” -> “去#1上#1学校#4.” Second use pypinyin to generate chinese pinyin: “去#1上#1学校#4.” -> “去#1上#1学校#4.|qu5 shang5 xue3 xiao3” Third generate its text features(phoneme, prosody, tone): “去#1上#1学校#4.|qu5 shang5 xue3 xiao3” -> "q u sh ang x ue x iao|#1 #1 #1 #1 #0 #0 #4 #4|5 5 5 5 3 3 3 3". Certainly, you should convert phoneme, prosody and tone to id.

_pause = ["sil", "eos", "sp", ...]
_initials = ["b", "c","ch", ...]
_finals = ["a", "ai", ...]
_tones = ["1", "2", "3", "4", "5"]
symbols = _pause + _initials + [i + j for i in _finals for j in _tones]

(2) As for me, just use the open dataset: aishell3 dataset(zhvoice dataset also can use, but its quality is very poor).

(3) Yes, "Pypinyin generated pinyin are error-prone", but in my view, if the dataset is big enough, the error will be average and "eliminate". Also in my experiment use aishell3 dataset, i can generate a normal audio, which sound not bad.

Have you insert blank index between the shengmu and yunmu, such as "n i h ao"-->[0 x 0 x 0 x 0 x 0], when you train ASR model? @liuhuang31

liuhuang31 commented 1 year ago

@hdmjdp hi, i didn't insert blank between shengmu and yunmu.

hdmjdp commented 1 year ago

thanks, So when you train ASR model, you did not insert any blank ?

hdmjdp commented 1 year ago

@liuhuang31 As you said, how setting the ctc loss config? blank_index = train_dataloader.dataset.text_cleaner.word_index_dictionary[" "] # get blank index criterion = build_criterion(critic_params={ 'ctc': {'blank': blank_index}, })

liuhuang31 commented 1 year ago

@hdmjdp The train data is as below: "aishell3/train/wav/SSB0018/audio/00180007.wav|25|sil r an4 #1 d a4 #0 j ia1 #1 k uai4 #0 d ian3 #1 x ia4 #0 l ai2 #4 。 eos"

after process, its data is "blank r an blank d a blank j ia blank k uai blank d ian blank x ia blank l ai blank 。 blank_"

hdmjdp commented 1 year ago

@yihuitang I see. All prosody label changed to blank. When you train styleTTS, you are also inserting blank spaces in the phoneme sequence?

liuhuang31 commented 1 year ago

@hdmjdp yes, styletts as the same as ASR

GuangChen2016 commented 1 year ago

@hdmjdp The train data is as below: "aishell3/train/wav/SSB0018/audio/00180007.wav|25|sil r an4 #1 d a4 #0 j ia1 #1 k uai4 #0 d ian3 #1 x ia4 #0 l ai2 #4 。 eos"

after process, its data is "blank r an blank d a blank j ia blank k uai blank d ian blank x ia blank l ai blank 。 blank_"

@liuhuang31 ,if you process all prosody label into blank, then how to control the pause in tts?

liuhuang31 commented 1 year ago

@GuangChen2016 hello, (1) for #1, add blank: such as "i#1love" -> "i blank love". (2) for #2, use #2 as a phoneme, add blank surround it, such as: "i#2love" -> "i blank #2 blank love". (3) for #3 must be followed by punctuation, such as: "i#1love#3,you" -> "i blank love blank , blank you". (4) for #4, same as #3, which must be followed by punctuation, such as: "i#1love#3,you#4." -> "i blank love blank , blank you blank . blank_".

GuangChen2016 commented 1 year ago

@liuhuang31 I see, thanks you. Could you share one synthesized samples?

liuhuang31 commented 1 year ago

@liuhuang31 I see, thanks you. Could you share one synthesized samples?

@GuangChen2016 you can give me several audio for speaker_ref(you have the audio's copyright or a open dataset), and give me several chinese text to generate.

GuangChen2016 commented 1 year ago

@liuhuang31
Synthesized Text: 杭州亚运会即将在9月开幕,这是继北京冬奥会之后,我国再次承办的一项国际大型体育赛事。然而,在这场盛会上,我们将看不到来自俄罗斯和白俄罗斯的运动员的身影。他们被国际奥委会以“技术原因”为由拒之门外,无缘参加杭州亚运会。 这一决定引起了我国的不满和反对。我国一直主张欢迎符合条件的俄罗斯和白俄罗斯运动员参加杭州亚运会,而不是对他们进行歧视和限制。我国认为,运动员是否参赛应该由他们自己的体育表现决定,而不是其他因素,包括战争等。我国还表示,愿意为他们搭建一个良好的参赛平台,让他们以中立身份参赛,并且不会影响奖牌的分配。 ref audios as belows: ref.zip

liuhuang31 commented 1 year ago

@liuhuang31 Synthesized Text: 杭州亚运会即将在9月开幕,这是继北京冬奥会之后,我国再次承办的一项国际大型体育赛事。然而,在这场盛会上,我们将看不到来自俄罗斯和白俄罗斯的运动员的身影。他们被国际奥委会以“技术原因”为由拒之门外,无缘参加杭州亚运会。 这一决定引起了我国的不满和反对。我国一直主张欢迎符合条件的俄罗斯和白俄罗斯运动员参加杭州亚运会,而不是对他们进行歧视和限制。我国认为,运动员是否参赛应该由他们自己的体育表现决定,而不是其他因素,包括战争等。我国还表示,愿意为他们搭建一个良好的参赛平台,让他们以中立身份参赛,并且不会影响奖牌的分配。 ref audios as belows: ref.zip

@GuangChen2016 For some reasons, the ref audio need longer than 9 seconds, so provided audio will copy to 9 seconds. The generate wave as below: ref_gen.zip

liuhuang31 commented 1 year ago

@GuangChen2016 In addition, ref audio supports any length audio. But when i convert styletts model to onnx model, the ref length is fixed to about 9 seconds, so ref audio need longer than 9 seconds.

sunnnnnnnny commented 11 months ago

@liuhuang31 it is well.Is this open source dataset train's model result ?

liuhuang31 commented 11 months ago

@liuhuang31 it is well.Is this open source dataset train's model result ?

@sunnnnnnnny yes, the dataset is: aishell3, zhvoice and vctk.

SaltedSlark commented 11 months ago

@liuhuang31 it is well.Is this open source dataset train's model result ?

@sunnnnnnnny yes, the dataset is: aishell3, zhvoice and vctk.

hi liu: As I know, VCTK is an English dataset, Why do u use it?