PyTorch implementation of Densely Connected Time Delay Neural Network (D-TDNN) in our paper "Densely Connected Time Delay Neural Network for Speaker Verification" (INTERSPEECH 2020).
[2023-05-04] 3D-Speaker supports training of CAM++ model and can be easily extended to support training of raw D-TDNN and CAM models. They also released a Chinese speaker embedding model trained on 200k speakers and an English speaker embedding model trained on VoxCeleb.
[2023-03-04] CAM++ achieved superior performance with lower computational complexity and faster inference speed than popular ECAPA-TDNN and ResNet34 systems.
H. Wang, S. Zheng, Y. Chen, L. Cheng, and Q. Chen, "CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking"
VoxCeleb1-E | VoxCeleb1-H | CN-Celeb | |
---|---|---|---|
ECAPA-TDNN | 1.07/0.1185 | 1.98/0.1956 | 7.45/0.4127 |
D-TDNN | 1.63/0.1748 | 2.86/0.2571 | 8.41/0.4683 |
CAM | 1.18/0.1257* | 2.15/0.1966* | - |
CAM++ | 0.89/0.0995 | 1.76/0.1729 | 6.78/0.3830 |
[2021-09-05] TimeDelay is replaced by Conv1d by default, since convolution is better optimized in all kinds of deep learning frameworks (Note: The pretrained models are directly converted from the old ones so that the results might be slightly different from those in the paper).
[2021-08-28] D-TDNN and D-TDNN-SS outperform SOTA system on the AP20-OLR-dialect-task of oriental language recognition (OLR) challenge 2020 (WeChat artical / paper), showing their potential on other speech processing tasks.
[2021-02-01] CAM adopts D-TDNN backbone and is enhanced by context-aware masking.
Y.-Q. Yu, S. Zheng, H. Suo, Y. Lei, and W.-J. Li, "CAM: Context-Aware Masking for Robust Speaker Verification" (ICASSP 2021)
VoxCeleb1-E | VoxCeleb1-H | |
---|---|---|
CAM | 1.18/0.1257 | 2.15/0.1966 |
We provide the pretrained models which can be used in many tasks such as:
You can either use Kaldi toolkit:
prepare_voxceleb1_test.sh
under $kaldi_root/egs/voxceleb/v2
and change the $datadir
and $voxceleb1_root
in it.chmod +x prepare_voxceleb1_test.sh && ./prepare_voxceleb1_test.sh
to generate 30-dim MFCCs.trials
under $datadir/test_no_sil
.Or checkout the kaldifeat branch if you do not want to install Kaldi.
python evaluate.py --root $datadir/test_no_sil --model D-TDNN --checkpoint dtdnn.pth --device cuda
VoxCeleb1-O
Model | Emb. | Params (M) | Loss | Backend | EER (%) | DCF_0.01 | DCF_0.001 |
---|---|---|---|---|---|---|---|
TDNN | 512 | 4.2 | Softmax | PLDA | 2.34 | 0.28 | 0.38 |
E-TDNN | 512 | 6.1 | Softmax | PLDA | 2.08 | 0.26 | 0.41 |
F-TDNN | 512 | 12.4 | Softmax | PLDA | 1.89 | 0.21 | 0.29 |
D-TDNN | 512 | 2.8 | Softmax | Cosine | 1.81 | 0.20 | 0.28 |
D-TDNN-SS (0) | 512 | 3.0 | Softmax | Cosine | 1.55 | 0.20 | 0.30 |
D-TDNN-SS | 512 | 3.5 | Softmax | Cosine | 1.41 | 0.19 | 0.24 |
D-TDNN-SS | 128 | 3.1 | AAM-Softmax | Cosine | 1.22 | 0.13 | 0.20 |
If you find D-TDNN helps your research, please cite
@inproceedings{DBLP:conf/interspeech/YuL20,
author = {Ya-Qi Yu and
Wu-Jun Li},
title = {Densely Connected Time Delay Neural Network for Speaker Verification},
booktitle = {Annual Conference of the International Speech Communication Association (INTERSPEECH)},
pages = {921--925},
year = {2020}
}
References:
[16] X. Li, W. Wang, X. Hu, and J. Yang, "Selective Kernel Networks," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 510-519.