yongxuUSTC / DNN-for-speech-enhancement

DNN-for-speech-enhancement
174 stars 70 forks source link

Is it convenient for you to share the pretrain model with me #12

Open wangjianfly2003 opened 7 years ago

wangjianfly2003 commented 7 years ago

Hi Dr.Xu,

Is it convenient for you to share the pretrain model with me?

yongxuUSTC commented 7 years ago

Hi, the initialized model was not pre-trained. Just with random initialization.

yongxuUSTC commented 7 years ago

ok, it is here: https://github.com/yongxuUSTC/DNN-for-speech-enhancement/tree/master/toolbox/weights

source code for initializing your model weights randomly and change back the weights for matlab decoding

wangjianfly2003 commented 7 years ago

Thank you very much for your kindly reply, Dr.Xu. You means i don't need to do the pretrain process, and can get the speech enhancement effect like you provide in DNN_speech_enhancement_tool using only the fine tune process?

yongxuUSTC commented 7 years ago

Yes, correct. Just with fine-tuning process with random initialization. I once tried RBM-based pre-training which did not work.

wangjianfly2003 commented 7 years ago

OK. i will try to train an new model with collected noisy data using the fine-tuning process you provide. Thank you very much.

From reading your decoding code, i guess you use noisy speech and noisy data as input feature, use clean speech and noisy as output feature to train the model you provide. Am i right? Besides, you use the normalized “timit_aurora4_115NT_7SNRS_each190_80uuts_noisy_lsp_be_random_linux_global_mv.mat” file to deal with the input noisy speech , however, i don't understand why you use this file to do DNN decoding , why not use the normalized output feature to do decoding?

yongxuUSTC commented 7 years ago

The direct mapping is from noisy speech log-power spectra to clean speech log-power spectra. Additionally, you can also predict noise log-power spectra, ideal binary mask, or ideal ratio mask to do some post-processing.

The norm file is used both for training and decoding. In the decoding, you should normalize the input noisy feature, and transform the enhanced feature back to the normal scale using the norm file.

yongxuUSTC commented 7 years ago

where do you find "“timit_aurora4_115NT_7SNRS_each190_80uuts_noisy_lsp_be_random_linux_global_mv.mat”" ?

I think i use a different one: https://drive.google.com/file/d/0B5r5bvRpQ5DRR1lIV1hpZ0RLQ0E/view

wangjianfly2003 commented 7 years ago

Hi Dr.XU. I made a mistake about the norm file. I tried the same norm file as you used.

In the "BP_GPU.cu" file, i think the code should be modified as below to make the output unit is linear, that is changed the second parameter from "cur_layer_y" to "cur_layer_x". cudaMemcpy(dev[0].out,cur_layer_x,n_framescur_layer_unitssizeof(float),cudaMemcpyDeviceToDevice);

Am i right?

yongxuUSTC commented 7 years ago

You are right. cudaMemcpy(dev[0].out,cur_layer_x,n_framescur_layer_unitssizeof(float),cudaMemcpyDeviceToDevice);

I think i uploaded the code for ideal binary mask prediction. I commented the sigmoid code, but forgot to change "cur_layer_y" to "cur_layer_x".

I have updated the code.

yongxuUSTC commented 7 years ago

please update "cv_bunch_single" func also

wangjianfly2003 commented 7 years ago

Hi Dr.Xu. Today i used noisy speech log-power spectra as input feature (50 TIMIT clean speech corrupted with 100 enviroment noise type with -5db SNR), clean speech log-power spectra as target feature to train the model, the learning rate is 0.0005, the layersize is 2827(257*11),2048,2048,2048,257, the weights is random initialization; the number of epoch is 35(the value of squared_err is decreased).Then i use the trained model to to decoding, but got a very poor effect, even can't hear the speech.

Could you tell me how to determine the cause of the problem?

the size of training set is too small? the decoding error is wrong? ...

wangjianfly2003 commented 7 years ago

Could you update the your "finetune_DNN_speech_enhancement_dropout_NAT.pl", "interface.cc" and "step1_DNNenh_for 16kHz.m" files for direct mapping model from noisy speech log-power spectra to clean speech log-power spectra. I think i only changed the above three files.

yongxuUSTC commented 7 years ago

If you want to check your code, you can map from clean to clean, if it still does not work. That means your code has some problem. You should do inverse-fea-norm as i did in step1_DNNenh_for 16kHz.m. Please ref "step1_DNNenh_for 16kHz.m" for decoding. There is no problem in the decoding code.

wangjianfly2003 commented 7 years ago

Hi Dr,Xu. I mapped from clean to clean, it seems it still does not work. So i started to check the code, and found that the map from 11 frames of input feature to one frame of target feature is correct, but the input data of frame 5 and frame 10 in para->indata are the same , i also checked the frame 5 and frame 10 in dataori, which are not the same. So i think maybe there are something wrong in the following code: for(j =0; j<= cur_frame_of_sent - para->fea_context;j++){ for(i =0;i< para->fea_context;i++){ for(k=0;k< para->fea_dim;k++){ para->indata[sample_index[cur_sample] para->layersizes[0] +k +i para->fea_dim] = dataori[(frames_processed +j +i) (2+para->fea_dim) +k+2]; } } I think the sentence "para->indata[sample_index[cur_sample] para->layersizes[0] +k +i para->fea_dim] = dataori[(frames_processed +j +i) (2+para->fea_dim) +k+2];" should be changed to "para->indata[sample_index[cur_sample] para->layersizes[0] +k +i para->fea_dim] = dataori[(frames_processed +j para->fea_context +i) (2+para->fea_dim) +k+2]; Am i right?

wangjianfly2003 commented 7 years ago

i comment the following code in interface.cc file: / i=i-1; for(k=129;k< 2(para->fea_dim);k++){ para->indata[sample_index[cur_sample] para->layersizes[0] +k +i para->fea_dim] = (dataori[(frames_processed + 0) (2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 1) (2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 2) (2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 3) (2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 4) (2+para->fea_dim) +(k-129)+2]+dataori[(frames_processed + 5) (2+para->fea_dim) +(k-129)+2])/6.0f; } */