flp1990 commented 1 year ago

Hi, I used Wenet to run the conformer-transformer experiment in librispeech 100h with the following settings with one RTX TITAN 24G . The results are as follows(70 epoches):

"dev_clean_attention_rescoring: English -> 14.17 % N=54402 C=47534 S=6046 D=822 I=843 dev_other_attention_rescoring:English -> 30.81 % N=50948 C=36937 S=11930 D=2081 I=1686 test_clean_attention_rescoring:English -> 14.21 % N=52576 C=45963 S=5806 D=807 I=857 dev_other_attention_rescoring:English -> 35.11 % N=52343 C=38074 S=11132 D=3137 I=4107"

This result is far lower than the model with the same parameters I ran on Espnet before on the librispeech 100h dataset. (https://github.com/espnet/espnet/tree/master/egs2/librispeech_100/asr1)

Can you add an experiment result of a single card on librispeech100h? If you have time to add this experiment, thank you very much.

network architecture

encoder related

encoder: conformer encoder_conf: output_size: 256 # dimension of attention attention_heads: 4 linear_units: 2048 # the number of units of position-wise feed forward num_blocks: 12 # the number of encoder blocks dropout_rate: 0.1 positional_dropout_rate: 0.1 attention_dropout_rate: 0.0 input_layer: conv2d # encoder input type, you can chose conv2d, conv2d6 and conv2d8 normalize_before: true cnn_module_kernel: 15 use_cnn_module: True activation_type: 'swish' pos_enc_layer_type: 'rel_pos' selfattention_layer_type: 'rel_selfattn'

decoder related

decoder: transformer decoder_conf: attention_heads: 4 linear_units: 2048 num_blocks: 6 dropout_rate: 0.1 positional_dropout_rate: 0.1 self_attention_dropout_rate: 0.0 src_attention_dropout_rate: 0.0

hybrid CTC/attention

model_conf: ctc_weight: 0.3 lsm_weight: 0.1 # label smoothing option length_normalized_loss: false

dataset related

dataset_conf: filter_conf: max_length: 2000 min_length: 50 token_max_length: 400 token_min_length: 1 min_output_input_ratio: 0.0005 max_output_input_ratio: 0.1 resample_conf: resample_rate: 16000 speed_perturb: true fbank_conf: num_mel_bins: 80 frame_shift: 10 frame_length: 25 dither: 0.0 spec_aug: true spec_aug_conf: num_t_mask: 2 num_f_mask: 2 max_t: 50 max_f: 10 shuffle: true shuffle_conf: shuffle_size: 1500 sort: true sort_conf: sort_size: 500 # sort_size should be less than shuffle_size batch_conf: batch_type: 'static' # static or dynamic batch_size: 12

grad_clip: 5 accum_grad: 1 max_epoch: 70 log_interval: 100

optim: adam optim_conf: lr: 0.004 scheduler: warmuplr # pytorch v1.1.0+ required scheduler_conf: warmup_steps: 25000

robin1001 commented 1 year ago

Sorry, we do not have the badwidth to do it recently.

flp1990 commented 1 year ago

Sorry, we do not have the badwidth to do it recently.

Ok, thanks.

cuongducle commented 1 year ago

I think for accum-grad, if you only have one gpu you can multiply it by 8 (gpus) compared to the value in original config.

xingchensong commented 1 year ago

you can try to remove positional emb

wenet-e2e / wenet

Can you add an experiment result of a single card on librispeech100h? #1614

network architecture

encoder related

decoder related

hybrid CTC/attention

dataset related