Closed didi1233 closed 2 months ago
Thank you for your question. We only save the model part before the speaker embedding layer in WeSpeaker, while the classifier (from embedding to speaker label) is not saved during training. For your case, you should modify the codes in save_checkpoint and save the classifier, otherwise the classifier would be with random parameters and thus the final results would be random too.
@didi1233 Have you solved this problem? My answer was not correct above. We did save the whole speaker model including the classifier you need, but during extracting, we did not load the classifier part.
For your case, if you train the model with the naive softmax loss, I think it should work well using our training pipeline. Did you find why it did not work?
@didi1233 Have you solved this problem? My answer was not correct above. We did save the whole speaker model including the classifier you need, but during extracting, we did not load the classifier part.
For your case, if you train the model with the naive softmax loss, I think it should work well using our training pipeline. Did you find why it did not work?
Hello, I have resolved this issue. Wespeaker does indeed save the classifier part. My problem was due to incorrectly using the 'class CombinedModel', which caused the parameters of the saved classification layer to fail to load correctly due to node naming issues. The correct example code is as follows:
import copy import os
import fire
import kaldiio
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
import torchaudio
import torchaudio.compliance.kaldi as kaldi
import torch
import torch.nn as nn
import torch.nn.functional as F
#from wespeaker.dataset.dataset import Dataset
from speaker_model import get_speaker_model
from checkpoint import load_checkpoint
#from wespeaker.utils.utils import parse_config_or_kwargs, validate_path
from utils import parse_config_or_kwargs
from projections import get_projection
def compute_fbank(wav_path,
num_mel_bins=40,
frame_length=25,
frame_shift=10,
dither=0.0):
""" Extract fbank, simlilar to the one in wespeaker.dataset.processor,
While integrating the wave reading and CMN.
"""
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform * (1 << 15)
mat = kaldi.fbank(waveform,
num_mel_bins=num_mel_bins,
frame_length=frame_length,
frame_shift=frame_shift,
dither=dither,
sample_frequency=sample_rate,
window_type='hamming',
use_energy=False)
# CMN, without CVN
mat = mat - torch.mean(mat, dim=0)
return mat
def extract(config='config.yaml', **kwargs):
configs = parse_config_or_kwargs(config, **kwargs)
batch_size = 1
num_workers = 1
torch.backends.cudnn.benchmark = False
model_path = 'model_10.pt'
model = get_speaker_model(configs['model'])(**configs['model_args'])
configs['projection_args']['embed_dim'] = configs['model_args']['embed_dim']
configs['projection_args']['num_class'] = 2 #your class num!!!
configs['projection_args']['do_lm'] = configs.get('do_lm', False)
projection = get_projection(configs['projection_args'])
model.add_module("projection", projection)
device = torch.device("cuda")
model.to(device).eval()
load_checkpoint(model, model_path)
print(model)
with torch.no_grad():
feats = compute_fbank('your.wav')
feats = feats.unsqueeze(0) # add batch dimension
features = feats.float().to(device)
#print(features.shape)
outputs = model(features)
embeds = outputs[-1] if isinstance(outputs, tuple) else outputs
embeds = embeds # (B,F)
outputs = projection(embeds).cpu().detach()
print(outputs)
if __name__ == '__main__':
fire.Fire(extract)
Additionally, it's necessary to set 'label = None' in 'https://github.com/wenet-e2e/wespeaker/blob/master/wespeaker/models/projections.py#L483'.
Attached is the binary classification model for noise and speech, along with the config.yaml file. I hope this will be helpful to those who need it.
Best regards! example.zip
I see. Thanks for your answer.
Dear WeSpeaker Team,
I am trying to use WeSpeaker for a classification task. I have trained a three-class model using ResNet and a Linear classifier. However, I would like the exported ONNX model to output the final class probabilities instead of the embeddings. I attempted to use the following code, but the inference results are very incorrect. Could you please help me identify the problem? Thank you.