yh1008 / speech-to-text

mixlingual speech recognition system; hybrid (GMM+NNet) model; Kaldi + Keras
http://llcao.net/cu-deeplearning17/project.html
70 stars 19 forks source link

WARNING: arpa2fst #12

Open yh1008 opened 7 years ago

yh1008 commented 7 years ago

To generate G.fst I executed

arpa2fst --disambig-symbol=#0 --read-symbol-table=$lang/words.txt $local/tmp/lm.arpa $lang/G.fst

which outputs the following warning:

yh2901@instance-1:~/kaldi/egs/codeswitch$ ./make_graph.sh 

===== MAKING G.fst =====

arpa2fst --disambig-symbol=#0 --read-symbol-table=data/lang/words.txt data/local/tmp/lm.arpa data/lang/G.fst 
LOG (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:96) Reading \data\ section.
LOG (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:151) Reading \1-grams: section.
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 11 [-5.472714    -ying   -0.3005793] skipped: word '-ying' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 21 [-5.472714    Archi   -0.2663992] skipped: word 'Archi' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 23 [-5.472714    Beijing -0.2994594] skipped: word 'Beijing' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 25 [-5.472714    Cers    -0.3004402] skipped: word 'Cers' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 28 [-5.472714    Deutsche    -0.3009956] skipped: word 'Deutsche' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 35 [-5.472714    Intel   -0.3010285] skipped: word 'Intel' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 36 [-5.472714    Inter   -0.3004702] skipped: word 'Inter' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 38 [-5.296623    J-Cs    -0.2981122] skipped: word 'J-Cs' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 40 [-4.732351    K-box   -0.281] skipped: word 'K-box' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 41 [-5.171684    K-pop   -0.2621106] skipped: word 'K-pop' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 47 [-5.171684    Malaysia    -0.2552751] skipped: word 'Malaysia' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 48 [-5.296623    Mochik  -0.2662709] skipped: word 'Mochik' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 53 [-5.472714    Psychometric    -0.3009759] skipped: word 'Psychometric' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 179 [-5.472714   Shanghai    -0.3010285] skipped: word 'Shanghai' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 180 [-5.472714   Shearwood   -0.2965806] skipped: word 'Shearwood' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 181 [-5.472714   Suzhou  -0.2997037] skipped: word 'Suzhou' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 182 [-5.472714   Swensens    -0.299923] skipped: word 'Swensens' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 184 [-5.472714   T-shirt -0.3006591] skipped: word 'T-shirt' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 193 [-5.472714   [di]    -0.3010029] skipped: word '[di]' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 194 [-5.472714   [gi]    -0.301] skipped: word '[gi]' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 195 [-5.472714   [uh-huh]    -0.3009854] skipped: word '[uh-huh]' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 198 [-2.450492   a   -0.2567758] skipped: word 'a' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 199 [-5.472714   a-famosa    -0.2955976] skipped: word 'a-famosa' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 200 [-5.472714   aback   -0.2994594] skipped: word 'aback' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 201 [-5.472714   abalone -0.3009656] skipped: word 'abalone' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 202 [-5.472714   abandoned   -0.3008771] skipped: word 'abandoned' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 203 [-5.472714   abduct  -0.2921268] skipped: word 'abduct' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 204 [-5.472714   abiding -0.2994594] skipped: word 'abiding' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 205 [-5.472714   abilities   -0.3001809] skipped: word 'abilities' not in symbol table
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:219) line 206 [-5.074774   ability -0.4649765] skipped: word 'ability' not in symbol table
LOG (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:151) Reading \2-grams: section.
LOG (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:151) Reading \3-grams: section.
WARNING (arpa2fst[5.0.61~1-37b53]:Read():arpa-file-parser.cc:259) Of 161603 parse warnings, 30 were reported. Run program with --max_warnings=-1 to see all warnings
LOG (arpa2fst[5.0.61~1-37b53]:RemoveRedundantStates():arpa-lm-compiler.cc:355) Reduced num-states from 91509 to 15997

Need to examine whether this XXX not in symbol table can be fixed (or simply does it matter)

yh1008 commented 7 years ago

the word symbol table (a.k.a. the word.txt) file does not contain the word listed above, which is weird cause the word.txt is supposed to be the unique words show up on text, it is possible that the shell script cut -d ' ' -f 2- text | sed 's/ /\n/g' | sort -u > words.txt used to generateword.txt , fail to parse these words

yh1008 commented 7 years ago

We are going to bring everything in transcript to UPPER case (lexicon has all UPPER case already)