How to make prediction to certain audio files?

tedfeng424 commented 3 years ago

Hi,

In the predict.py, you calculated the pair-wise similarity and EER using the testing files. How can we generate a predicted label if we are given just one audio file instead pairs of audios?

Thanks

tedfeng424 commented 3 years ago

I know this model is for speaker verification, is it possible to make it a speaker recognition if we have less speakers?

zabir-nabil commented 3 years ago

Hi, the repository is designed to do speaker verification easily. So, one thing you can do is, you can have a list of speaker anchor audios, then you can compare the similarity scores between each anchor, whichever is the lowest should be the target speaker. If you have a closed set of speakers, and you train on that set of speakers, you can directly take the softmax output to find the speaker.

Yes, it should work for a small set of speakers too.

tedfeng424 commented 3 years ago

Thanks for the reply!

I do have a closed set of speakers and I'll try to take the softmax output. As for the speaker anchor audios you mentioned, is better to have one audio for each speaker or is better to have multiple audios for each speaker and calculate the mean similarity?

Thank you!

zabir-nabil commented 3 years ago

It's always better to take the mean from multiple utterances.

tedfeng424 commented 3 years ago

if loss == 'softmax':
          y = keras.layers.Dense(num_class, activation='softmax',
                                 kernel_initializer='orthogonal',
                                 use_bias=False, trainable=True,
                                 kernel_regularizer=keras.regularizers.l2(weight_decay),
                                 bias_regularizer=keras.regularizers.l2(weight_decay),
                                 name='prediction')(x)
          trnloss = 'categorical_crossentropy'

elif loss == 'amsoftmax':
    x_l2 = keras.layers.Lambda(lambda x: K.l2_normalize(x, 1))(x)
    y = keras.layers.Dense(num_class,
                           kernel_initializer='orthogonal',
                           use_bias=False, trainable=True,
                           kernel_constraint=keras.constraints.unit_norm(),
                           kernel_regularizer=keras.regularizers.l2(weight_decay),
                           bias_regularizer=keras.regularizers.l2(weight_decay),
                           name='prediction')(x_l2)
    trnloss = amsoftmax_loss

else:
    raise IOError('==> unknown loss.')

if mode == 'eval':
    y = keras.layers.Lambda(lambda x: keras.backend.l2_normalize(x, 1))(x)

model = keras.models.Model(inputs, y, name='vggvox_resnet2D_{}_{}'.format(loss, aggregation))

if mode == 'train':
    if mgpu > 1:
        model = ModelMGPU(model, gpus=mgpu)
    # set up optimizer.
    if args.optimizer == 'adam':  opt = keras.optimizers.Adam(lr=1e-3)
    elif args.optimizer =='sgd':  opt = keras.optimizers.SGD(lr=0.1, momentum=0.9, decay=0.0, nesterov=True)
    else: raise IOError('==> unknown optimizer type')
    model.compile(optimizer=opt, loss=trnloss, metrics=['acc'])
return model

I am trying to take the softmax output of the model and I realize that the model has an softmax output when the mode is "train", can I just use this to find the speakers after training on a closed set of speakers? Or should I add another softmax layer after the normalizing layer when the mode is "eval"?

Thanks!

(Sorry to bother you again, I'm new to neural network and tensorflow so I may have a lot of questions)

zabir-nabil commented 3 years ago

How many speakers do you have in your closed set? I don't think you will get a reasonable accuracy if you have a very small set of speakers and you directly treat the problem as classification.

Usually, the model needs to be trained on a large number of speakers (around 1000 or more). Then, we try to measure the cosine distance between two speaker audios to find if they are from same speaker or not.

You can check out the predict.py file. You can also check out the issues in the original repository. https://github.com/WeidiXie/VGG-Speaker-Recognition

tedfeng424 commented 3 years ago

My dataset has 6 speakers and a total of 9000 audio files. Thanks for the reply, I’ll go check out the original repo!

zabir-nabil / tf2-speaker-recognition

How to make prediction to certain audio files? #1