york135 / zero_shot_svs_ASRU2023

The official implementation of the paper "Zero-shot singing voice synthesis from musical score"
7 stars 1 forks source link

如何获取 MPop600 数据集? #1

Open MuHyeonSon opened 2 months ago

MuHyeonSon commented 2 months ago

你好。感谢你们公开了如此出色的项目。

我想使用你们的模型进行推理。据我所知,MPop600 数据集应该用作推理输入的数据。乐谱输入到模型中,我需要下载 MPop600 数据集,以准确了解输入到模型中的乐谱是什么并进行推理,对吧?另外,听说需要联系作者才能下载数据集,但我找遍了也没有找到联系作者的方法,所以想请问你们是否有可行的途径来获取这些数据。

谢谢你们 😃

york135 commented 2 months ago

Hi MuHyeonSon,

Well, I actually obtained the dataset in 2021 by contacting Prof. Yi-Wen Liu (the fourth author of the MPOP600 dataset's paper) via e-mail (ywliu@ee.nthu.edu.tw). Maybe you can also send an e-mail to him. Hopefully he may agree to share the dataset to you.

By the way, I noticed you wrote both Chinese and English version of the issue (?). It seems like in the English version you wanted to know how does the musical score look like so you can feed your own data for inference?

If this is true, I think I can give you a sample of that dataset's musical score annotation so that you know how does it look like. Here is a sample of the first phrase of m1_001 (one of the test data in my train/test split): https://drive.google.com/file/d/1MiJK6u8O_ls3bM-qVOjmD7NFfro8whEQ/view?usp=sharing

It's basically just a simple musicxml file with lyrics labels.

Feel free to ask more questions, but I'm pretty busy recently so I still don't have time to modify this repo to support the use of the M4Singer dataset's data format (which is a much larger dataset and is publicly available)......

MuHyeonSon commented 2 months ago

Hi MuHyeonSon,

Well, I actually obtained the dataset in 2021 by contacting Prof. Yi-Wen Liu (the fourth author of the MPOP600 dataset's paper) via e-mail (ywliu@ee.nthu.edu.tw). Maybe you can also send an e-mail to him. Hopefully he may agree to share the dataset to you.

By the way, I noticed you wrote both Chinese and English version of the issue (?). It seems like in the English version you wanted to know how does the musical score look like so you can feed your own data for inference?

If this is true, I think I can give you a sample of that dataset's musical score annotation so that you know how does it look like. Here is a sample of the first phrase of m1_001 (one of the test data in my train/test split): https://drive.google.com/file/d/1MiJK6u8O_ls3bM-qVOjmD7NFfro8whEQ/view?usp=sharing

It's basically just a simple musicxml file with lyrics labels.

Feel free to ask more questions, but I'm pretty busy recently so I still don't have time to modify this repo to support the use of the M4Singer dataset's data format (which is a much larger dataset and is publicly available)......

Thank you so much for responding so kindly and quickly. 😺

I could get the data thanks to you. Also, you shared a simple musicxml sample, so I could do inference from the model you provided.

I'd like to ask you a few more questions.

  1. I wonder if the model "model_0311_propose_300000.pth.tar" you're currently releasing is a fully trained model.

  2. I was wondering if the quality of generation can be better in terms of sound quality if I proceed with further learning.

  3. I wonder if "model_0311_propose_300000.pth.tar" is trained using both "weakly labeled" and "Strongly labeled".

  4. Also, if the data exists, I would like to ask if it is possible to learn in other countries' languages.

Thank you for reading my question even though you must be busy. Have a peaceful day. ☀️

york135 commented 2 months ago

Hi,

Sorry for the late reply. Yes, it is a fully trained model using both weakly labeled and strongly labeled data. It is also true that it does not generate audios with high audio quality (which aligns with my experiment results, where the audio naturalness is not very satisfactory). I suspect that the scale of data is still not large enough (just 10hr labeled data and 50hr weakly labeled data, while a zero-shot speech synthesis model typically uses hundreds of hours of training data). That's also why I think one should probably use M4Singer, which has 30hr of labeled data.

As for the question 4, yes, you should be able to train a similar model in other languages. I've actually tried the Kiritan database (Japanese). The model does synthesize Japanese, though the audio quality is also not very good.

MuHyeonSon commented 2 months ago

Thank you for your detailed answer!! I was curious about it, and it worked out well.

I look forward to and cheer you on for such wonderful research in the future! 😃