yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.95k stars 417 forks source link

Fine-Tune Exception `Kernel size can't be greater than actual input size` #80

Closed devidw closed 11 months ago

devidw commented 11 months ago

Trying to fine-tune on a custom dataset

Everything starts normal, after some steps the script dies with:

RuntimeError: Calculated padded input size per channel: (5 x 4). Kernel size: (5 x 5). Kernel size can't be greater than actual input size

full log:

Some weights of the model checkpoint at microsoft/wavlm-base-plus were not used when initializing WavLMModel: ['encoder.pos_conv_embed.conv.weight_g', 'encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing WavLMModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing WavLMModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of WavLMModel were not initialized from the model checkpoint at microsoft/wavlm-base-plus and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'encoder.pos_conv_embed.conv.parametrizations.weight.original0']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
bert loaded
bert_encoder loaded
predictor loaded
decoder loaded
text_encoder loaded
predictor_encoder loaded
style_encoder loaded
diffusion loaded
text_aligner loaded
pitch_extractor loaded
mpd loaded
msd loaded
wd loaded
BERT AdamW (
Parameter Group 0
    amsgrad: False
    base_momentum: 0.85
    betas: (0.9, 0.99)
    capturable: False
    differentiable: False
    eps: 1e-09
    foreach: None
    fused: None
    initial_lr: 1e-05
    lr: 1e-05
    max_lr: 2e-05
    max_momentum: 0.95
    maximize: False
    min_lr: 0
    weight_decay: 0.01
)
decoder AdamW (
Parameter Group 0
    amsgrad: False
    base_momentum: 0.85
    betas: (0.0, 0.99)
    capturable: False
    differentiable: False
    eps: 1e-09
    foreach: None
    fused: None
    initial_lr: 0.0001
    lr: 0.0001
    max_lr: 0.0002
    max_momentum: 0.95
    maximize: False
    min_lr: 0
    weight_decay: 0.0001
)
d46cd4f7-3a0f-4f32-b526-bc16ad7e06e2.wav 22050
7e02e715-bcdb-49ec-be36-b3a68f38a5e0.wav 22050
937bb873-614b-4ccb-af4b-bd68fb75406b.wav 22050
403f6c79-a96f-45f9-b594-eddb6ec94b94.wav 22050
b68e1492-32c3-4eb1-b867-2eeb73f7462c.wav 22050
e3b03590-983b-4a99-b5af-56a87c349e33.wav 22050
2f3dbcca-fdff-41aa-b6c3-4a0da57c53a2.wav 22050
7f5587b6-bb14-470f-a986-5d032ff30749.wav 22050
202e81f8-f079-4802-9717-bcae18092404.wav 22050
17bbc0fd-28b6-4dce-8ac7-dafe24475fd0.wav 22050
e8a372e6-23ed-4987-a20e-16f0ce00c42e.wav 22050
39cef89b-b9cb-4541-9cec-6ec7d06c0350.wav 22050
a5b3bb7f-4910-4038-bd98-dc4dda8be39e.wav 22050
457d9c66-8fc6-4952-9dfb-c18f992df48f.wav 22050
8eb827b4-fca0-4a73-97cd-eaba2eca739f.wav 22050
2ce3ae41-9328-46e8-b857-9a7c5f8adc95.wav 22050
cfb78b7d-a618-4b6f-9dac-e5b7491cb521.wav 22050
fe6f77f1-3f96-4ad5-8b91-bed1c3e9c80d.wav 22050
86190c57-442c-4f1a-a2e8-4c9476d8c9e7.wav 22050
d551b815-e632-440b-98da-4b9b9cf56bd0.wav 22050
7a309082-0c4f-4528-a695-33d68c95f3ae.wav 22050
d984df73-0a49-4c42-b377-5efc91c434ca.wav 22050
009e44fa-a050-4406-a4ff-3794e381125a.wav 22050
ce2a78ea-4065-4aad-94a0-27a46da859ce.wav 22050
fc135efc-fae5-467a-a31a-9451df3f7b0a.wav 22050
afdd06e3-813a-4139-a058-a6d90ab3cce1.wav 22050
f1293283-1c2b-45f3-86f0-7fbf0bb91116.wav 22050
bd9ab5d7-fe0b-4e97-943f-692adc47eec5.wav 22050
71720fae-2666-498e-8ef4-481acf7faba5.wav 22050
784a725f-c9b2-4dbe-b2f3-0713bbe3e3aa.wav 22050
32ba1ebf-0d48-4a7a-abac-ccd60ecfd10c.wav 22050
3c325b72-3740-4474-81a0-86f30f4d576e.wav 22050
99139618-03c0-451a-842b-e1b960ea7849.wav 22050
bafefb22-403d-4c43-a777-4775fdb46ef4.wav 22050
ba8f9ace-4735-412c-bd11-9da416471617.wav 22050
5355566d-e987-4a53-b68a-e1c8646c285d.wav 22050
378991d7-9bb7-4571-821a-adab8860e55f.wav 22050
fb78aea6-11e9-4a63-ae98-6ce5f7858ede.wav 22050
0e65b59b-a6fc-4e4e-93a3-fc0d2f571857.wav 22050
c452c873-f2ed-4628-babc-89248fa74084.wav 22050
ab7cdb3e-650d-485e-a61a-a65162be4560.wav 22050
f45d2d30-20a3-44b8-9e72-4f89db471f4a.wav 22050
2822ecc0-4a4f-4483-9554-77a59cd9b84b.wav 22050
861e540d-7525-482b-943a-fc4a036476d8.wav 22050
877a7b3a-7ea0-4f9e-95ce-b11dfbbbae33.wav 22050
7c682043-9126-4ed4-87e2-61a858affcc2.wav 22050
f1387461-bb0c-49c8-a43f-559686f03d17.wav 22050
44d0e811-efc1-46e1-8691-b5c18588437c.wav 22050
7cc58d00-8846-4286-96a0-6c639d912c51.wav 22050
3b3fa58f-eb67-4d97-b74c-b3fa0d10b6ff.wav 22050
4b76f88e-d603-458c-bba5-90ee812ac803.wav 22050
dc74fb03-afdb-4397-a070-c7550c70466f.wav 22050
8fa7afff-1f0c-4e2d-aa1e-dd53c8093d45.wav 22050
37f7a00f-a684-4836-8053-d4f5b3cff69d.wav 22050
1ac99f28-436e-4bb5-bc81-92cf6d1456c9.wav 22050
549aae30-3c7f-4186-ab74-24a010da1117.wav 22050
ed4f67ac-da7b-4074-95e6-5c3ef5eda109.wav 22050
47644ed6-008a-4195-89f8-3d8c07ee20bc.wav 22050
0ac2dff9-3110-40e0-83a1-625fe65e666e.wav 22050
36934ee2-419e-4840-b995-a40a33c2e282.wav 22050
6061e4ab-09c8-4486-b01f-c1b36f6305de.wav 22050
6ef78268-e953-41c6-b9bc-694d1add311a.wav 22050
16a353bf-a18a-489e-95b6-f0445b9a93c7.wav 22050
726b22f1-ab6b-4f3b-bf15-eaf2a2fc5996.wav 22050
2d37820a-d559-420e-a636-97ee7ab99028.wav 22050
f62b271d-d2eb-4ac0-9965-745e83c38c80.wav 22050
4039a3da-8e9f-46a3-bdd3-b4468c7da36b.wav 22050
f5992cb3-cad7-4a0b-807a-7965f213dfb9.wav 22050
ffaad1b3-51c0-468a-a591-731b699ea095.wav 22050
fe6f77f1-3f96-4ad5-8b91-bed1c3e9c80d.wav 22050
a8f5acd3-c437-4cb1-a21a-6f45eb9bd930.wav 22050
6061e4ab-09c8-4486-b01f-c1b36f6305de.wav 22050
f1296797-fb17-4c72-8769-cd36c068ac3b.wav 22050
ff411147-43f1-406d-aaf4-be7531005a23.wav 22050
5a27432b-6989-4e2d-a314-6e5fca947aa9.wav 22050
72873cce-57d3-492d-8a0d-e596356c4e96.wav 22050
de091988-d313-4edd-8400-0d9ce596b6b7.wav 22050
888c4d20-9576-48a1-9b87-24de58cd5470.wav 22050
193a0645-080a-4376-bbad-e217162c31bc.wav 22050
f065df9a-81e5-407e-9638-38f62744f49e.wav 22050
98d14d56-6caa-46ab-b800-ccbaae98dd88.wav 22050
fa3ad0c4-fa10-41bf-a0db-1153ed874909.wav 22050
5ecb1617-dc0a-4bb2-8f3e-c01ad9d7ce53.wav 22050
7a309082-0c4f-4528-a695-33d68c95f3ae.wav 22050
7c583bb1-7048-48b9-af15-1df1bf9f9518.wav 22050
f8041487-3d22-4420-aaec-64534a620f35.wav 22050
4c0e5aab-465d-4f05-93eb-2b6f72b1233e.wav 22050
a5491187-ca92-4656-b82d-faf2365ee81e.wav 22050
75325909-b507-4d24-a9b1-ab2809967c01.wav 22050
f0183f8d-37ee-4a3a-a89e-317a62175279.wav 22050
b00ef500-ff9c-46f1-b667-00a9c523e846.wav 22050
2a7c97cc-5b75-422e-a164-cee24a047620.wav 22050
90cdd26f-4899-4c52-b3e2-dddc6b93721f.wav 22050
a06ae1f6-e69c-407d-8959-49c6e8094e10.wav 22050
34825224-95af-4de6-828b-26b1a9fd1d35.wav 22050
1ac99f28-436e-4bb5-bc81-92cf6d1456c9.wav 22050
4cc210c5-2f62-45b1-8125-f0d37e44af4f.wav 22050
2c520378-e237-4931-9cf3-ccdb4d2b9445.wav 22050
7e2b1d47-6fd3-4111-912b-f56b414ae82e.wav 22050
6c806871-dba8-4280-aad5-9b380718fa1e.wav 22050
8ccb5d49-88f9-4acf-8121-af1439cd96ee.wav 22050
b4049a18-36eb-46f7-a698-a15a44ac5fc1.wav 22050
2f687db9-7cc9-48fa-9621-33c31b2f7a90.wav 22050
8b2753d3-4da1-44be-9836-18f5abde45b0.wav 22050
8d579460-493a-4b6f-bc9c-4792a87b4dbe.wav 22050
b0982826-b643-4f2e-b685-85cf8ccd9a0a.wav 22050
fe6a0aad-e2d0-4915-9b52-0125eb9e3708.wav 22050
f0058326-f13b-4f22-99b1-b7dd766aa336.wav 22050
3a13b9d7-ea94-4654-bccb-2918836caefb.wav 22050
4b7b3435-6e1d-4ecd-9dd9-22cf6ef269d4.wav 22050
1a410f00-0725-47fc-b39b-7e55f22a3ebe.wav 22050
2f3dbcca-fdff-41aa-b6c3-4a0da57c53a2.wav 22050
d9c702a1-2b70-40d2-8267-4461c094a4ec.wav 22050
0e1199dd-6b99-46d3-835e-b211b8654512.wav 22050
d1778113-6a41-405c-affa-241d83160775.wav 22050
b03de603-3395-454d-918e-3117a7a8110a.wav 22050
14910d3b-7dd3-48c3-95f3-239b800c8357.wav 22050
fe6a0aad-e2d0-4915-9b52-0125eb9e3708.wav 22050
3cb33ae7-4719-4e53-9096-674aeb377a1f.wav 22050
ec003321-1315-4871-87c4-d26679498046.wav 22050
f1f4c69e-0f39-4149-b555-c7b989423d15.wav 22050
37af4eb1-fc94-47ec-ae55-548f9c295955.wav 22050
6cf5d652-b51d-4db4-9162-97a1b66648ad.wav 22050
6d450317-56d6-43e3-b050-36fc355ed2d0.wav 22050
16c28077-58c5-4819-99ae-d8d9e20fd819.wav 22050
8780bb33-6dc7-4728-8c57-b3971a2eb18c.wav 22050
df65f660-4cb1-4b21-bcf5-f64a2191e36b.wav 22050
877a7b3a-7ea0-4f9e-95ce-b11dfbbbae33.wav 22050
5c021313-f884-47f2-977b-ba4cdf719278.wav 22050
d551b815-e632-440b-98da-4b9b9cf56bd0.wav 22050
a06ae1f6-e69c-407d-8959-49c6e8094e10.wav 22050
cb7cefba-bf40-434b-acd9-165350b1939e.wav 22050
6a22795e-d7ec-4adb-b805-4cff6ed53c6f.wav 22050
5c021313-f884-47f2-977b-ba4cdf719278.wav 22050
b447bbb0-02df-4c33-bcc3-829efd7d3f6c.wav 22050
d7afaae2-154f-448f-84e2-ae84b1c84f54.wav 22050
fcc016be-ab8d-41f6-8bab-4a66f064a048.wav 22050
a61fb583-cfd3-4da8-abd2-2dd3b4ea3f16.wav 22050
d7afaae2-154f-448f-84e2-ae84b1c84f54.wav 22050
9998e8ab-4ab8-4770-beae-d0e6daac047a.wav 22050
7f5587b6-bb14-470f-a986-5d032ff30749.wav 22050
733637fc-6dc5-43fd-a5d0-fe371124d4bd.wav 22050
53a5b846-f199-4e45-9fc6-8360a67b0889.wav 22050
82095a2f-10fa-411b-9025-bfa4339b262b.wav 22050
41901912-4aaf-4b1a-bc95-8359087f4451.wav 22050
0d5c144c-a423-420c-96a4-5d9456c0b6e2.wav 22050
5753fa7c-2d15-478a-af35-cd0863f290da.wav 22050
009cd2c7-a5fb-4c51-8ff0-9e9d94a63359.wav 22050
cdc438c3-2dcc-40a6-909a-6efdd2858fdd.wav 22050
ad59525e-a562-474d-b0d4-220c186c8333.wav 22050
69191f7c-b66e-4e12-949b-4ef809101187.wav 22050
20274467-2ecd-465d-871a-3c7a99286433.wav 22050
1aed4a66-c342-4597-a1c8-2ab88c0b1613.wav 22050
457f84e7-9200-40a6-acda-f063e2ad29a6.wav 22050
b3181edb-2c87-473f-904d-dad173f96ab8.wav 22050
ca7e40da-1410-432c-899e-7219b37c20b2.wav 22050
ce33134c-4671-4e43-a801-df936e4aa230.wav 22050
4d1eee48-c37d-433a-89ca-875688f2014a.wav 22050
5396b980-59f4-4c40-becd-10d2754963bd.wav 22050
6a3c3239-6017-4d57-b29d-35b80701539c.wav 22050
06e5b215-b383-456e-abaa-d9b4875884ed.wav 22050
2ad54ff4-11d7-4aed-824a-4b49910517c2.wav 22050
8f104655-5a64-47fa-a1ee-d38a6c0aa760.wav 22050
745330b0-fe85-4bca-b9b5-d9d2bae6c258.wav 22050
7fbf0cd1-27c0-4c2e-80fe-b7d4f13bd6e5.wav 22050
74a335b3-f319-43d2-841b-68ce7c806786.wav 22050
059a7c84-d817-43bd-9097-2508254a54ce.wav 22050
07fffcfa-1463-4ec0-ae6d-4cb8c41a244a.wav 22050
Epoch [1/50], Step [10/138], Loss: 0.27303, Disc Loss: 3.71597, Dur Loss: 0.96925, CE Loss: 0.06391, Norm Loss: 0.57278, F0 Loss: 1.80698, LM Loss: 1.40955, Gen Loss: 6.72089, Sty Loss: 0.00000, Diff Loss: 0.00000, DiscLM Loss: 0.00000, GenLM Loss: 0.00000, SLoss: 0.00000, S2S Loss: 0.44512, Mono Loss: 0.04838
Time elasped: 20.510742664337158
cb4319a1-bb93-47d4-9e63-5a1056cdf11c.wav 22050
f8041487-3d22-4420-aaec-64534a620f35.wav 22050
f87c187b-a206-4959-9213-98399d6eedbb.wav 22050
d58c39dc-1f48-40ca-bb0f-83d48e69e17d.wav 22050
1ff8e643-4b28-4f28-b025-0a291efd2f88.wav 22050
009cd2c7-a5fb-4c51-8ff0-9e9d94a63359.wav 22050
9d4b576e-d854-465e-ade0-2f3b589fd332.wav 22050
ca5b7052-e268-49c2-8b09-b323dbffccc9.wav 22050
00571230-a5a1-4aaa-8901-90d26f35b166.wav 22050
2ad54ff4-11d7-4aed-824a-4b49910517c2.wav 22050
2822ecc0-4a4f-4483-9554-77a59cd9b84b.wav 22050
4c5ca7f7-00bd-4a94-b7a1-20e5b1f80349.wav 22050
0b027070-3a6d-462d-8e99-fe46c6412318.wav 22050
5a6e3080-66a7-4cbf-b179-d072bc67d239.wav 22050
73f23394-25e7-40a2-be04-1d45811dec35.wav 22050
af6970ff-07c7-4721-be9e-4c6af3814b72.wav 22050
a3eacc90-ebfc-45e1-ba8c-e3d5e1872526.wav 22050
e044bd8e-5f9b-45a1-b909-d8e38d79ffab.wav 22050
440288a1-30a8-41dc-ba16-6db81a5aae13.wav 22050
57a79b2e-52d6-490a-b76c-72426220c397.wav 22050
a8bb364b-30c6-42de-b5d7-90e52b16c4aa.wav 22050
73f23394-25e7-40a2-be04-1d45811dec35.wav 22050
55b6b5be-30ef-4e8a-bcb1-ae96b63586a5.wav 22050
b4049a18-36eb-46f7-a698-a15a44ac5fc1.wav 22050
f3d5fa2a-6dd3-4bf8-9dbd-51c8236e347e.wav 22050
3a3e9d0f-e328-483d-a8dd-9c1445f947e2.wav 22050
a68662ec-8c25-4756-8a81-669cdd28848c.wav 22050
4dea6284-4019-4776-98ce-3ca1f9297db9.wav 22050
d7ec7354-5163-4f0c-950a-e2927a971969.wav 22050
4e3319a2-aff0-4123-8d91-074e08e6042d.wav 22050
b0982826-b643-4f2e-b685-85cf8ccd9a0a.wav 22050
d5d48e65-b894-4ae2-baaa-82adf62a78fb.wav 22050
a94e4a78-154c-459d-a74c-7a43ee34363e.wav 22050
7fbf0cd1-27c0-4c2e-80fe-b7d4f13bd6e5.wav 22050
6a188fc5-3898-4bdf-b202-2017197c2d0c.wav 22050
490e70ed-c318-4100-8e95-e6a80f8f5310.wav 22050
2ec6e7ff-981f-42c5-af7d-7b6dc6299881.wav 22050
b69e0d16-afeb-4e18-8e23-e5221a7e7081.wav 22050
222f7f25-50f8-4ebe-b99e-15a3d641d4ec.wav 22050
ae7f0e22-b66f-40a8-b283-15b55625b37b.wav 22050
f6b5edf3-c24e-4b85-9553-46b7669c04c2.wav 22050
0f5472b4-5997-4314-aab7-3cfba6a8dd92.wav 22050
3587a361-303a-4250-a45a-61b4bf44ad03.wav 22050
6e57867c-456b-4d2f-9ff1-1b6ab8950b3a.wav 22050
7c3dab6f-cca0-4804-b27a-e0a3ef873031.wav 22050
222f7f25-50f8-4ebe-b99e-15a3d641d4ec.wav 22050
1342325e-b96d-41a8-83d4-8a0409167ee0.wav 22050
f3d5fa2a-6dd3-4bf8-9dbd-51c8236e347e.wav 22050
638c7874-906e-421c-a4a5-97e475729534.wav 22050
c6b10239-d54b-470b-a55d-f1d58d849a25.wav 22050
e5bd5b6d-e878-4c4d-9103-25cafb2e9c89.wav 22050
5f9f024f-eb04-4e1c-902a-582a0204464b.wav 22050
8480733b-4869-4150-abfa-6c19d5ce830c.wav 22050
5696bade-2f3c-40ed-ae93-f94c2a7e168f.wav 22050
3c325b72-3740-4474-81a0-86f30f4d576e.wav 22050
36bd13a3-a1f1-4d8a-8032-d6d5426aafdf.wav 22050
405fbbe4-8e35-47a6-adfd-168dc13a2df3.wav 22050
59ff72e3-0573-4a41-b029-57b7a0ddcd5e.wav 22050
544bbd34-2def-483c-990f-fc92a3b6c1f2.wav 22050
8cfe50a4-7a08-47f5-af0e-085cdfdabf8d.wav 22050
afa349bb-3c8d-4296-bbc4-bdacd8f694cb.wav 22050
50e59d22-94ca-4d69-9f42-30518ea91f0d.wav 22050
5840726c-e4a2-4760-88fb-c8b9bb514f9e.wav 22050
4ae8c2d2-7dd0-4b7a-907b-1b92830a69af.wav 22050
cb92fd2f-c75f-417b-a08f-21c462299a77.wav 22050
1a18e16c-45ad-4125-802a-a243ad5db785.wav 22050
ce90aefe-0d3e-4b61-9544-4193928a7cef.wav 22050
5f312b9b-f869-4009-a484-2205daf50e16.wav 22050
d5d48e65-b894-4ae2-baaa-82adf62a78fb.wav 22050
7e2b1d47-6fd3-4111-912b-f56b414ae82e.wav 22050
89199aa0-b8d4-4fd3-89cc-339e3edf024d.wav 22050
bb4475ba-28e1-45b9-8b8b-5cce5ecd188c.wav 22050
797a1978-c26f-4526-8633-59af3c52df76.wav 22050
26942d06-ba24-4e01-be44-64283a68679d.wav 22050
0fcdffec-800e-42fc-87e4-eafdf24b86f1.wav 22050
25392188-9fc3-47e5-884b-c8e76af689b8.wav 22050
b5883c5c-48a2-48a3-987c-a77f06ddc711.wav 22050
2d9bf320-3ea3-477a-acb3-900128e033a2.wav 22050
73b4e901-e383-415b-9d23-fa4a7d7647aa.wav 22050
457f84e7-9200-40a6-acda-f063e2ad29a6.wav 22050
36f328e4-5fe0-4a3e-a37f-d8681ce4dc1b.wav 22050
ab7cdb3e-650d-485e-a61a-a65162be4560.wav 22050
888ed655-0c81-47c8-bfaf-0ccd25dc6f5b.wav 22050
8ff518f1-4907-4f28-bd23-a0a55bd0f538.wav 22050
74369d60-31d7-45ec-ae46-2f224d9e85db.wav 22050
8409779d-8bd0-4af5-9b7f-0f2ac052a948.wav 22050
c23e4741-9194-4537-a169-205fdb36d3c8.wav 22050
5753fa7c-2d15-478a-af35-cd0863f290da.wav 22050
8f4df69a-0557-4b62-b824-0db83d80da0a.wav 22050
607359db-68cb-4901-9c4d-b6271dbf2a28.wav 22050
2a1fd5dc-d361-47dd-8eee-dbcaf5881782.wav 22050
62f747d4-f659-4b70-b7ff-fa8141473730.wav 22050
f1c6b415-95ae-457a-b181-0ab4c0022045.wav 22050
65bd34a5-dcc2-4895-aeb4-a66ba6b1082f.wav 22050
b69f1546-3e68-4764-844c-b157905fecb8.wav 22050
c6d4d1ba-ed41-4bf9-8379-c51a81cdd276.wav 22050
5a6e3080-66a7-4cbf-b179-d072bc67d239.wav 22050
6fbd1c50-6930-4067-b407-4e469c5d9efb.wav 22050
dae177ad-f85c-4fd9-8771-955040cb5211.wav 22050
5002e13d-4ca3-4fc7-82cc-7503e331f07c.wav 22050
3f995cfb-2669-414d-9d6a-1f85ca123379.wav 22050
1b3d21ea-ed4d-496b-9547-357fa5e88359.wav 22050
6ef78268-e953-41c6-b9bc-694d1add311a.wav 22050
82e0eda4-c433-416c-a469-93191e75a7b0.wav 22050
acfeced1-3140-4a40-8e35-c3f8e29d7391.wav 22050
b8fe1778-6246-40ce-958d-e2ccd4006e3f.wav 22050
b166d95f-8767-49c8-a12d-f7a7931c44c8.wav 22050
64501578-052a-498c-b672-6fb322a96928.wav 22050
62d79324-e910-40cf-9fa7-69526247e719.wav 22050
ba762c7f-3a35-444a-b775-d9083a4c3afc.wav 22050
aea26b7d-70a5-47db-bc95-0dd5646ceab4.wav 22050
00571230-a5a1-4aaa-8901-90d26f35b166.wav 22050
83fd8b50-0cd2-4dd7-b3eb-bf0f6d9438e2.wav 22050
9bf291ce-3c04-4930-ba89-ea540fd26c44.wav 22050
f4b75894-31cc-4b85-81c8-eadc333d95cd.wav 22050
48d8e567-b75a-4cb5-8609-0e9aca0dbea2.wav 22050
5e1d8ae0-3230-4028-bdae-8be5ed88e40d.wav 22050
3146b455-ab75-4e98-8c65-b4a17a6ee3a4.wav 22050
2992c0f6-f296-44ba-a9bf-3f4aefa7966a.wav 22050
e4c15884-d603-4548-ae26-3d0355b85e23.wav 22050
Epoch [1/50], Step [20/138], Loss: 0.30853, Disc Loss: 3.90680, Dur Loss: 0.86952, CE Loss: 0.04484, Norm Loss: 0.89577, F0 Loss: 2.21516, LM Loss: 1.15780, Gen Loss: 6.54385, Sty Loss: 0.00000, Diff Loss: 0.00000, DiscLM Loss: 0.00000, GenLM Loss: 0.00000, SLoss: 0.00000, S2S Loss: 0.76925, Mono Loss: 0.05646
Time elasped: 37.099867820739746
491dc1ad-69a9-444f-8471-86b0ca55c0c1.wav 22050
638c7874-906e-421c-a4a5-97e475729534.wav 22050
dba48bd0-b6f1-467d-962c-1968fb34c899.wav 22050
dd82bcd0-004f-47bb-83d5-383b88f86c72.wav 22050
4d856a6c-cd50-4f5a-9db7-996885044756.wav 22050
613f07ce-14ce-4d2d-bddf-6a970302113c.wav 22050
4945d808-39c7-4dda-b17a-6726324da3fd.wav 22050
dc74fb03-afdb-4397-a070-c7550c70466f.wav 22050
dba52661-d5ec-4ea9-9cbf-00f294703ecd.wav 22050
7a277473-7475-4dc3-8bce-f5dd3026850a.wav 22050
5f9f024f-eb04-4e1c-902a-582a0204464b.wav 22050
5f90d878-7ac1-4a6d-bba9-a5d8e6df3c9b.wav 22050
0bec935b-8569-4a85-a6e5-b75244ab74c4.wav 22050
58ed7c94-8a33-4fdf-9340-f6ae69e4c326.wav 22050
c783f330-4b4f-4c8d-a3ce-2a4660725a22.wav 22050
8ff518f1-4907-4f28-bd23-a0a55bd0f538.wav 22050
44eb475c-ba6b-4dcb-8082-27bce7cc83e4.wav 22050
20274467-2ecd-465d-871a-3c7a99286433.wav 22050
2fe365bd-10cb-4363-9854-d4c4dfbf9397.wav 22050
d9d3e55b-e620-45da-82df-898db80246f5.wav 22050
60717a2c-efd6-4de9-949d-b3f82e69715a.wav 22050
502fa5b9-167c-4ded-86ad-da0597d699bf.wav 22050
37af4eb1-fc94-47ec-ae55-548f9c295955.wav 22050
db2fd941-5734-48b5-a021-cbb6a30881cd.wav 22050
Traceback (most recent call last):
  File "train_finetune.py", line 707, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "train_finetune.py", line 302, in main
    s = model.predictor_encoder(mel.unsqueeze(0).unsqueeze(1))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 183, in forward
    return self.module(*inputs[0], **module_kwargs[0])
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/shared/StyleTTS2/models.py", line 160, in forward
    h = self.shared(x)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 215, in forward
    input = module(input)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (5 x 4). Kernel size: (5 x 5). Kernel size can't be greater than actual input size

config:

log_dir: "Models/allison-1125"
save_freq: 5
log_interval: 10
device: "cuda"
epochs: 50 # number of finetuning epoch (1 hour of data)
batch_size: 6
max_len: 100 # maximum number of frames
pretrained_model: "Models/LibriTTS/epochs_2nd_00020.pth"
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters

F0_path: "Utils/JDC/bst.t7"
ASR_config: "Utils/ASR/config.yml"
ASR_path: "Utils/ASR/epoch_00080.pth"
PLBERT_dir: 'Utils/PLBERT/'

data_params:
  train_data: "/home/ubuntu/shared/allison-1125/train_list.txt"
  val_data: "/home/ubuntu/shared/allison-1125/val_list.txt"
  root_path: "/home/ubuntu/shared/allison-1125/wavs"
  OOD_data: "Data/OOD_texts.txt"
  min_length: 50 # sample until texts with this size are obtained for OOD texts

preprocess_params:
  sr: 24000
  spect_params:
    n_fft: 2048
    win_length: 1200
    hop_length: 300

model_params:
  multispeaker: true

  dim_in: 64 
  hidden_dim: 512
  max_conv_dim: 512
  n_layer: 3
  n_mels: 80

  n_token: 178 # number of phoneme tokens
  max_dur: 50 # maximum duration of a single phoneme
  style_dim: 128 # style vector size

  dropout: 0.2

  # config for decoder
  decoder: 
      type: 'hifigan' # either hifigan or istftnet
      resblock_kernel_sizes: [3,7,11]
      upsample_rates :  [10,5,3,2]
      upsample_initial_channel: 512
      resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
      upsample_kernel_sizes: [20,10,6,4]

  # speech language model config
  slm:
      model: 'microsoft/wavlm-base-plus'
      sr: 16000 # sampling rate of SLM
      hidden: 768 # hidden size of SLM
      nlayers: 13 # number of layers of SLM
      initial_channel: 64 # initial channels of SLM discriminator head

  # style diffusion model config
  diffusion:
    embedding_mask_proba: 0.1
    # transformer config
    transformer:
      num_layers: 3
      num_heads: 8
      head_features: 64
      multiplier: 2

    # diffusion distribution config
    dist:
      sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
      estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
      mean: -3.0
      std: 1.0

loss_params:
    lambda_mel: 5. # mel reconstruction loss
    lambda_gen: 1. # generator loss
    lambda_slm: 1. # slm feature matching loss

    lambda_mono: 1. # monotonic alignment loss (TMA)
    lambda_s2s: 1. # sequence-to-sequence loss (TMA)

    lambda_F0: 1. # F0 reconstruction loss
    lambda_norm: 1. # norm reconstruction loss
    lambda_dur: 1. # duration loss
    lambda_ce: 20. # duration predictor probability output CE loss
    lambda_sty: 1. # style reconstruction loss
    lambda_diff: 1. # score matching loss

    diff_epoch: 10 # style diffusion starting epoch
    joint_epoch: 110 # joint training starting epoch

optimizer_params:
  lr: 0.0001 # general learning rate
  bert_lr: 0.00001 # learning rate for PLBERT
  ft_lr: 0.0001 # learning rate for acoustic modules

slmadv_params:
  min_len: 400 # minimum length of samples
  max_len: 500 # maximum length of samples
  batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
  iter: 10 # update the discriminator every this iterations of generator update
  thresh: 5 # gradient norm above which the gradient is scaled
  scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
  sig: 1.5 # sigma for differentiable duration modeling

The training data looks like this:

1342325e-b96d-41a8-83d4-8a0409167ee0.wav|aɪ nˈoʊ nˈʌθɪŋ ɐbˈaʊt hˌɪm , aɪ mˈɜːmɚ , tɹˈaɪɪŋ ænd fˈeɪlɪŋ tə səpɹˈɛs maɪ ɹˈaɪzɪŋ pˈænɪk .|0
...
yl4579 commented 11 months ago

See #48

yl4579 commented 11 months ago

I think there's some bug in the code that couldn't handle with the input is less than one second. I have edited README. The data sample must be more than one second at least.