Finetuning on other dataset

wenet-e2e / wespeaker

Research and Production Oriented Speaker Verification, Recognition and Diarization Toolkit

Apache License 2.0

707 stars 116 forks source link

Hi. How I can understand it's impossible to use a model which was trained on some larger dataset and then use it checkpoint to train on another small dataset, right? When I'm trying this, I'm getting error:

wespeaker/wespeaker/utils/checkpoint.py", line 21, in load_checkpoint   
    model.load_state_dict(checkpoint, strict=False)                                                                                            
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1483, in load_state_dict                                      
    self.__class__.__name__, "\n\t".join(error_msgs)))                                                                                         
RuntimeError: Error(s) in loading state_dict for ResNet:                                                                                       
        size mismatch for projection.weight: copying a param with shape torch.Size([26685, 256]) from checkpoint, the shape in current model is
 torch.Size([7548, 256]).

How I understand its output shape, but I don't understand, where these numbers are to come from (26685, 7548). My first guess was that, it's a number of speakers, but I don't have so many speakers in my datasets.

We support two types of ``checkpoints'' in our codes:

trained on large dataset and then finetuned on small dataset: set model_init as /the/path/to/large/dataset/model in your conf/xxx.yaml
training was aborted unexpectedly and start training from some specific epoch The first one will train the projection layer from scratch as the training spk_num changes, while the second one will reuse it because the training set are the same. For your case, I think you should use the first strategy and remember to set the model_init parameter.

For your 2nd question, I guess you have used speed perturb augmentation in your training (default is true). Different speed is regarded as new speakers because the pitch is changed, so the training spk_num would be 3 * #spk of your datasets.

wenet-e2e / wespeaker

Finetuning on other dataset #168