wenet-e2e / wenet

Production First and Production Ready End-to-End Speech Recognition Toolkit
https://wenet-e2e.github.io/wenet/
Apache License 2.0
4.14k stars 1.07k forks source link

Synchronous decoding #899

Closed madkote closed 2 years ago

madkote commented 2 years ago

Coming from Kaldi experience, I have discovered that C++ implementation is asynchronous.

What is it about:

So, in asynchronous mode (Wenet) is impossible (or almost) to align audio and decoder results. Provided tools in runtime can do the following:

Question Is it possible to run wenet runtime (decoder_main, etc, I also have my c++ sample code) in synchronous mode?

To Reproduce

std::shared_ptr<wenet::TorchAsrModel> model_ = nullptr;
std::shared_ptr<fst::SymbolTable> symbol_table_ = nullptr;
model_ = std::make_shared<wenet::TorchAsrModel>();
model_->Read("models/final.zip", 1);
symbol_table_ = std::shared_ptr<fst::SymbolTable>(fst::SymbolTable::ReadText("models/words.txt"));

std::shared_ptr<wenet::FeaturePipeline> feature_pipeline_ = nullptr;
std::shared_ptr<wenet::TorchAsrDecoder> decoder_ = nullptr;
std::shared_ptr<wenet::FeaturePipelineConfig> feature_config_ = nullptr;
std::shared_ptr<wenet::DecodeOptions> decode_config_ = nullptr;
std::shared_ptr<wenet::DecodeResource> decode_resource_ = nullptr;
feature_config_ = std::make_shared<wenet::FeaturePipelineConfig>(80, 16000);
feature_pipeline_ = std::make_shared<wenet::FeaturePipeline>(*feature_config_);
decode_config_ = std::make_shared<wenet::DecodeOptions>();
decode_config_->chunk_size = 4; //16;
decode_config_->num_left_chunks = -1;
...
// 
// Note: all settings are default - except the chunk_size, which is set to 4, to avoide decoding stuck.
//
...
decoder_ = std::make_shared<wenet::TorchAsrDecoder>(feature_pipeline_, decode_resource_, *decode_config_);
wenet::WavReader wav_reader("test.wav");
std::vector<float> data = std::vector<float>(wav_reader.data(), wav_reader.data() + wav_reader.num_sample());
const int sample_rate = 16000;
const int num_sample = wav_reader.num_sample();
const float interval = 0.5;
const int sample_interval = interval * sample_rate;
for (int start = 0; start < num_sample; start += sample_interval) {
  int end = std::min(start + sample_interval, num_sample);
  std::vector<float> chunk;
  chunk.reserve(end - start);
  for (int j = start; j < end; j++) {
    chunk.push_back(data[j]);
  }
  feature_pipeline_->AcceptWaveform(chunk);
  //
  // the decoder might stuck here - but depends on parameters.
  // Anyhow, decoder does not provide results aligned with the audio chunks
  // ???
  // wenet::DecodeState state = decoder_->Decode();
  //
}
feature_pipeline_->set_input_finished();
...

// Here receive results from decoder - which is the only way to get results.
// Only when audio is complete feed to feature pipeline and then wait for decoder.
while (true) {
  wenet::DecodeState state = decoder_->Decode();
  ...
}

Expected behavior Receive result from decoder for each audio chunk.

robin1001 commented 2 years ago

The pipeline is almost the same as Kaldi. The problem is the audio chunk for feature pipeline and decoding is different. If the audio chunk feed to feature pipeline is less than which required for decoding, the decoding will be blocked.

Is it possible to run wenet runtime (decoder_main, etc, I also have my c++ sample code) in synchronous mode?

Yes, and It's easy to do this. The difference is you should add some function like NumFramesReady() in Kaldi and call it before https://github.com/wenet-e2e/wenet/blob/main/runtime/core/decoder/torch_asr_decoder.cc#L105. If the accumulated frames are enough for one decoding chunk, go ahead, otherwise just return and wait for the next decoding.

madkote commented 2 years ago

Thanks @robin1001 !