wenet-e2e / wenet

Production First and Production Ready End-to-End Speech Recognition Toolkit
https://wenet-e2e.github.io/wenet/
Apache License 2.0
4.08k stars 1.07k forks source link

swbd run.sh data preparation error #2558

Closed caiyuxi closed 1 week ago

caiyuxi commented 3 months ago

Describe the bug Unable to train model using wenet/examples/swbd/s0/run.sh

To Reproduce Steps to reproduce the behavior:

  1. Go to wenet/examples/swbd/s0/run.sh
  2. Modify swbd1_dir and eval2000_dir
  3. Modify
    export CUDA_VISIBLE_DEVICES="0"
    stage=-1 # start from 0 if you need to start from data preparation
  4. run run.sh

Expected behavior Correctly download data and start training

Log and Errors

 *** Downloading trascriptions and dictionary ***
--2024-06-17 13:42:46--  http://www.openslr.org/resources/5/switchboard_word_alignments.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://openslr.elda.org/resources/5/switchboard_word_alignments.tar.gz [following]
--2024-06-17 13:42:46--  http://openslr.elda.org/resources/5/switchboard_word_alignments.tar.gz
Resolving openslr.elda.org (openslr.elda.org)... 141.94.109.138, 2001:41d0:203:ad8a::
Connecting to openslr.elda.org (openslr.elda.org)|141.94.109.138|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49651293 (47M) [application/x-gzip]
Saving to: ‘switchboard_word_alignments.tar.gz’

switchboard_word_alignments.tar.gz              100%[======================================================================================================>]  47.35M  12.6MB/s    in 5.1s    

2024-06-17 13:42:52 (9.28 MB/s) - ‘switchboard_word_alignments.tar.gz’ saved [49651293/49651293]

File data/local/dict_nosp/lexicon0.txt is read-only; trying to patch anyway
patching file data/local/dict_nosp/lexicon0.txt
Prepared input dictionary and phone-sets for Switchboard phase 1.
Warning: expected 2435 or 2438 data data files, found 0
Switchboard-1 data preparation succeeded.
local/swbd1_data_prep.sh: line 144: utils/fix_data_dir.sh: No such file or directory
Expecting directory <my-path>/swbd/LDC2002S09/hub5e_00/english to be present
tools/subset_data_dir.sh: reducing #utt from 264333 to 4000
tools/subset_data_dir.sh: reducing #utt from 264333 to 260333
Reduced number of utterances from 260333 to 192827
cp: cannot stat 'data/eval2000/text': No such file or directory
run.sh: line 82: data/eval2000/text.org2: No such file or directory
cut: data/eval2000/text.org: No such file or directory
awk: fatal: cannot open file `data/eval2000/text.org' for reading (No such file or directory)
run.sh: line 83: data/eval2000/text: No such file or directory
tools/fix_data_dir.sh: no such file data/eval2000/utt2spk

Additional context and questions

  1. Tried to change utils/fix_data_dir.sh to tools/fix_data_dir.sh: got rid of local/swbd1_data_prep.sh: line 144: utils/fix_data_dir.sh: No such file or directory error.
  2. find -L $SWBD_DIR -iname '*.sph' returns empty. Is there a pre-requisite step missing in the script?
robin1001 commented 2 months ago

You can refer https://github.com/kaldi-asr/kaldi/tree/master/egs/swbd/s5 to figure out the problem.