wiseman / py-webrtcvad

Python interface to the WebRTC Voice Activity Detector
Other
2.03k stars 406 forks source link

using webrtcvad in realtime application #29

Closed viduraakalanka closed 5 years ago

viduraakalanka commented 5 years ago

Hi, I was curious whether we could use this tool for real time application. That is whether we could detect voices coming directly from a mic in a noisy environment. If so I would be glad if anyone could give me or direct me to such implementation.

wiseman commented 5 years ago

Yes, absolutely. I use it that way all the time. Look at pyaudio for an example of reading audio input from a microphone etc.

matanox commented 4 years ago

I'm actually using python-sounddevice (pyaudio may seem less maintained or less well-documented right now). Still struggling with making the audio input byte conversion work ...

#!/usr/bin/env python3
"""
Process microphone input stream endlessly.
The streaming of the input by the underlying sounddevice library
might take ~7% of a single i7 processor's utilization, before
any processing of ours.
"""

import sys
import time
import sounddevice as sd
import numpy as np # required to avoid crashing in assigning the callback input which is a numpy object
import webrtcvad

channels = [1]
# translate channel numbers to be 0-indexed
mapping  = [c - 1 for c in channels]

# get the default audio input device and its sample rate
device_info = sd.query_devices(None, 'input')
sample_rate = int(device_info['default_samplerate'])

interval_size = 30 # audio interval size in ms
downsample = 1

# get an instance of webrtc's voice activity detection
vad = webrtcvad.Vad()

print("reading audio stream from default audio input device:\n" + str(sd.query_devices()) + '\n')
print(F"audio input channels to process: {channels}")
print(F"sample_rate: {sample_rate}")
print(F"frame size:  {interval_size} ms" + '\n')

def voice_activity_detection(audio_data):
    return vad.is_speech(audio_data, sample_rate)

def audio_callback(indata, frames, time, status):
    """This is called (from a separate thread) for each audio block."""
    if status:
        print(status, file=sys.stderr)

    audio_data = indata[::downsample, mapping]
    audio_data = audio_data.astype(np.float16).tobytes()  # from float32 that we get from sounddevice
    detection = voice_activity_detection(audio_data)
    print(f'{detection} \r', end="") # use just one line to show the detection status (speech / not-speech)

with sd.InputStream(
    device=None, # the default input device
    channels=max(channels),
    samplerate=sample_rate,
    blocksize=int(sample_rate * interval_size / 1000),
    callback=audio_callback):

    while True:
        time.sleep(0.02) # just to avoid shutting down to process the endless input audio stream

Will post my working example when this works. I think that the way I translate the audio input for webrtcvad is incorrect as it's always returning that speech is detected when no speech is present.

matanox commented 4 years ago

Using that code I would also get an occasional:

python: src/hostapi/alsa/pa_linux_alsa.c:3636: PaAlsaStreamComponent_BeginPolling: Assertion `ret == self->nfds' failed

Whereas running example.py on input wav files works fine. This error is probably unrelated but it might indicate that my audio input via ALSA is unstable.

matanox commented 4 years ago

Well, I suspect that the detection is sensitive to the signal normalization .... I am seeing noticeably different normalization schemes manifested in audio files coming from different sources, or much more so differences in the normalization compared to my microphone input using the code which I posted above. It's hard to tell how webrtc would have accurate performance on such different inputs unless it's been trained on wildly diverse data ...

E.g. wav files of speech I have used from https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html and https://nats.gitlab.io/swc/ (two audio corpora roughly speaking intended for machine learning) seem to be normalized to [0,1] whereas my microphone input (as received from sounddevice) is [-1,+1] normalized. Obviously no VAD model trained on one range ([0,1]) should be expected to work very brilliantly on another ([-1,+1]).

There are also various types of "data noise" such as clipping that could interfere, though I doubt this is my issue with my software stack and specific audio hardware.

This all assuming that I'm reading byte streams into floats as intended, in my experiments that yield these tentative insights. I'm working on linux, and I think from following the code as much as I did, that both the soundstream stack and the python-webrtc stack assume little endian.

The various data sources I have tested around (the two corpora and my microphone streaming data) all yield numbers mostly in a reasonable range to make suspicions about wrong endianness, byte order or floating point standards an unnecessary worry, although there are noticeably nan values (could it be the result of clipping?)

So it could seem that normalizing raw input streams from streaming microphone audio to resemble that of raw input streams from wav files (which python-webrtc demonstrates its examples on) might be necessary, I will try that next time around.

Unless anyone has a quick pointer on a radically different tangent for arriving at the cause or alleviation for the inaccurate almost-constant prediction of voice activity when none is present...

matanox commented 4 years ago

Here's a working version for it. I solved the alsa exception with a patched version of PortAudio, and figured I would normalize my input range to that which works on the wav files I tested with. This code relies on sounddevice not pyaudio, for its input signal acquisition.

I can say it works now, albeit it feels not exceptionally more than a naive energy level detector, as most music genres I tried on a quick test also register as speech activity with this code. I do wonder if that's what other people experience (and also where the pre-trained model at core of the webrtc vad might be documented; it can tell a lot about what audio scenarios it's been aimed at).

So here's my working code, for microphone streaming voice activity detection using this library:

#!/usr/bin/env python3
"""
Process microphone input stream endlessly.
The streaming of the input by the underlying sounddevice library
might take ~7% of a single i7 processor's utilization, before
any processing of ours.

With the vad model employed, it's around ~40% of a single cpu's time.
Luckily we have multi-core machines these days ...
"""

import sys
import time
import sounddevice as sd
import numpy as np # required to avoid crashing in assigning the callback input which is a numpy object
import webrtcvad

channels = [1]
# translate channel numbers to be 0-indexed
mapping  = [c - 1 for c in channels]

# get the default audio input device and its sample rate
device_info = sd.query_devices(None, 'input')
sample_rate = int(device_info['default_samplerate'])

interval_size = 30 # audio interval size in ms
downsample = 1

block_size = sample_rate * interval_size / 1000

# get an instance of webrtc's voice activity detection
vad = webrtcvad.Vad()

print("reading audio stream from default audio input device:\n" + str(sd.query_devices()) + '\n')
print(F"audio input channels to process: {channels}")
print(F"sample_rate: {sample_rate}")
print(F"window size: {interval_size} ms")
print(F"datums per window: {block_size}")
print()

def voice_activity_detection(audio_data):
    return vad.is_speech(audio_data, sample_rate)

def audio_callback(indata, frames, time, status):
    """This is called (from a separate thread) for each audio block."""
    if status:
        print(F"underlying audio stack warning:{status}", file=sys.stderr)

    assert frames == block_size
    audio_data = indata[::downsample, mapping]        # possibly downsample, in a naive way
    audio_data = map(lambda x: (x+1)/2, audio_data)   # normalize from [-1,+1] to [0,1], you might not need it with different microphones/drivers
    audio_data = np.fromiter(audio_data, np.float16)  # adapt to expected float type

    # uncomment to debug the audio input, or run sounddevice's mic input visualization for that
    #print(f'{sum(audio_data)} \r', end="")
    #print(f'min: {min(audio_data)}, max: {max(audio_data)}, sum: {sum(audio_data)}')

    audio_data = audio_data.tobytes()
    detection = voice_activity_detection(audio_data)
    print(f'{detection} \r', end="") # use just one line to show the detection status (speech / not-speech)

with sd.InputStream(
    device=None,  # the default input device
    channels=max(channels),
    samplerate=sample_rate,
    blocksize=int(block_size),
    callback=audio_callback):

    # avoid shutting down for endless processing of input stream audio
    while True:
        time.sleep(0.1)  # intermittently wake up

I can put this elsewhere if it's not welcome for this repository, but I could humbly suggest it could be added to the examples for how to use py-webrtcvad for microphone stream processing ... especially if verified by other people on their own machine beyond mine ..

I have verified this code with Ubuntu 18.04, python 3.7.3 and what I believe to be (IIRC) the most recent versions of the prerequisites which are implied by this platform and version of python.

Thanks for this awesome wrapper library.