wenet-e2e / wenet

Production First and Production Ready End-to-End Speech Recognition Toolkit
https://wenet-e2e.github.io/wenet/
Apache License 2.0
4.14k stars 1.07k forks source link

WFST decoding without space #1048

Closed ghost closed 2 years ago

ghost commented 2 years ago

Bug Detail Hello, I'm an engineer who makes a voice recognition model through your wenet library. For your honor, we successfully made our e2e model with reasonable WER. As a result of using various decoding methods, there was no problem with the decoding functions implemented in python. However, when using WFST decoding method which is compiled in runtime C++ code, it made an issue.

The database we use is a mixture of English, Korean, and numbers, and as a result of WFST decoding, all space disappears in languages other than English. Also, there was a phenomenon that only lowercase letters appeared in English without capitalization and lowercase letters.

Bug Result

Reference:
2 1000 10 8 년 8 월 6 일 left 열두시 방향 병변에 대해 조직검사 시행해 fibroadenoma 로 진단되었음 period 줄바꿔서 2 1000 10 9 년 3 월 2 10 6 일 breast US 와 비교판독함 period 줄바꾸고 줄바꿔서 양측 breast parenchyma 가 heterogeneous echotexture 를 보임 period 줄바꿔서 right breast 한시 방향 nipple 에서 4 cm 거리에 약 0 ssjum 7 cm 크기의 benign looking mass 가 있으며 comma 이전과 비교하여 큰 변화 없음 period 줄바꿔서 left breast 열두시 방향 nipple 에서 2 cm 거리에 약 0 ssjum 7 cm 크기의 biopsy proven lesion 이 있으며 이전과 비교하여 큰 변화 없음 period 줄바꿔서 그 외 양측 breast 에 several cyst 들이 있음 period

Hypothesis with WFST:
21000108년8월6일left열두시방향병변에대해조직검사시행에fibroadenoma로진단되었음period줄바꿔서21000109년3월2106일breast us와비교판독함period줄바꾸고줄바꿔서양측breast parenchyma가heterogeneous echotexture를보임period줄바꿔서right breast한시방향nipple에서4cm거리에약0ssjum7cm크기의benign looking mass가있으며comma이전과비교하여큰변화없음period줄바꿔서left breast열두시방향nipple에서2cm거리에약0ssjum7cm크기의biopsy proven lesion이있으며이전과비교하여큰변화없음period그외양측breast에several cyst들이>있음period

Is there any solution to a problem like this?

xingchensong commented 2 years ago

Hi, u can set language_type=1 to preserve blanks and lowercacse=false to capitalize english words.

https://github.com/wenet-e2e/wenet/blob/main/runtime/core/decoder/params.h#L63-L68

https://github.com/wenet-e2e/wenet/blob/main/runtime/core/post_processor/post_processor.cc#L26-L56