wenet-e2e / wesignal

Production first, nn-based on-device signal processing toolkit.
Apache License 2.0
63 stars 3 forks source link

This is really interesting as the initial audio processing is often absent #1

Open StuartIanNaylor opened 1 year ago

StuartIanNaylor commented 1 year ago

1...

There is an early repo of yours @robin1001 https://github.com/robin1001/beamforming

Multichannel channel delay-sum beamforming compared to others if quite low load and can run at even a micro-controller level. I hacked in Portaudio to the above to get a realtime version https://github.com/StuartIanNaylor/2ch_delay_sum Horrid hack but was just testing how it would operate and easily runs with much spare on a Pi3.

The reduction in reverberation can considerabilly extend far-field and attenuate non focussed noise and the beam direction can likely be set via KWS for that command sentence. Its only the GCC-Phat that creates much load that increases on each addition to the ref signal. I think its likely with 4 channel you can treat it as 2x 2-channel with one reversed and use Pythagoras to calc the delay length on rectangular x4 signal arrays and so needing a single GCC-Phat calc hugely cutting load.

2 channel works well as I2S and even a usb sound card like https://plugable.com/products/usb-audio has a stereo dac

2...

I didn't find a similar low load BSS alg and my hack still remains my 1st ever attempt with C, but the same still applies where n-channel BSS often has secondary beneficial effects in attenuating reverberation and extending far-field. The problem with many n-channel BSS algs is the output channel order can be totally random where targetted speech extraction seems the way to go.

With some laterial thought as Esspresif seem to have done is they have low load KWS on each channel output and overall this still reduces load drastically over much more complex models. I have a hunch they use bc-resnet kws as mention in https://github.com/google-research/google-research/tree/master/kws_streaming of this paper https://arxiv.org/pdf/2106.04140.pdf

3...

Both the above have a focus on low-load IMO is important as I think multiple low cost distributed wireless arrays servicing a zone can provide better results than a single unit hi-load/ hi-tech unit and maybe select the best zone stream on some sort of arg-max/RMS of KWS where multiple simpler algs just work by having a positional coverage where one is always closer. Also I think you can share a central ASR over multiple zones and why I am a fan of a client server model of distributed wireless arrays in regards to mic/ASR Whatever alg you use they can create a tonal signature to artefacts that can have effect on WER and having tools to add levels of various noise samples to your dataset and process with the alg you use so that its trained into the model can greatly increase accuracy.

4...

AEC is only really applicable if the device recording is also playing as linear AEC can be extremely sensitive to clock drift. Its only really applicable to smart speaker/assitant consumer devices where my preference for opensource is to provide more interoperable systems where wireless audio and mics can interface to a single unit in race-till-idle provding for multiple zones. Non linear AEC I guess could be used, but the simple physics of seperating mic & speaker likely use algs such as speech seperation or ANC alone.

5...

SpeexAGC alsaplugin doesn't get installed on debian based systems as the SpeexDSP version for some reason is a RC whilst alsaplugins expects the full release so it doesn't get compiled. If you pull the last release of Speex and SpeexDSP compile and install and recompile your install version of alsaplugins it then works. I think though its missing an easily added param as currently the only param seems to a rate param and it misses a max gain to stop the AGC ramping the noise floor right up.

6... Provide tools to create spacial filter preprocessed ASR/KWS datasets so that filter signatures are trained into models

Really interesting Repo and really looking forward to what you guys come up with.

https://github.com/Rikorose/DeepFilterNet is the highest quality filter I have ever used but its 48Khz full band and quite high load especially as the Ladspa plugin uses a single thread so its needs quite a capable core but likely could use a multithreaded framework other than tract.

StuartIanNaylor commented 11 months ago

@robin1001 Hows things going?

I have been searching for some form of binaural audio source seperation alg/model and there are some but seem to be offline than online/realtime. If you can get the alg even on modest hardware 2x KWS running on each seperated stream, you can use ondevice training to capture KW and create a profile model to bias the main one.

I have always thought of KWS-mics to be satellites that themselves can be a distributed wireless array using best argmax for the stream choice. That on-device training would actually be upstream and a rolling window of the last KW would be sent and stored only on completed ASR commands (Assumed correct)

What ever alg we use it will impart a fingerprint into the processed audio and to make optimal the ASR & KWS datasets should be mixed with noise using a spatial audio lib such as pyroom then processed to give optimised datasets for ASR & KWS.

Still searching for a good low computational binaural source seperation alg/model though but the rest is straightforward. Something like https://github.com/vivjay30/Cone-of-Silence/issues/9#issuecomment-764408318 would seem perfect but unfortunately its non-casual https://github.com/vivjay30/Cone-of-Silence/issues/9#issuecomment-764408318 but curious why a rolling window would not work