Problem with speech recognition

ebracci commented 3 years ago

Hi,

I'm using wit.ai in a simple python application in order to trascribe audio (Speech to text) and I have encountered the following issue:

when i send audio with the related api for transcription, it seems that the transcription stops at the first "point with no audio" without finishing the transcription.

How can I avoid this?

Thank you

ruoyipu commented 3 years ago

Hi @ebracci,

For streaming requests, we accept 10s chunks, otherwise for non streaming requests, it is cut off after 20s. (https://wit.ai/docs/http/20200513/#post__speech_link)

If you'd like us to take a deeper look, please provide examples of your requests.

ebracci commented 3 years ago

Thank you for your answer and sorry for my bad english but I'll try to explain you the problem with an example.

I have to divide the 30 minutes audio in segment of 10 seconds due to request timeout and that's ok.

I have encountered the following issue: if an audio segment of 10 seconds contains silence, the request will return the transcpription of the audio segment until the silence.

Example: if the audio lasts 10 seconds and there is a pause at 5 seconds, the request will return the transcript of the first 5 seconds only.

That's my problem.. To avoid this I have to remove the "silence chunks" from the audio but it becomes "complicated" with longer audio.

This is the sample code that I am using.

import os

from pydub import AudioSegment
from pydub.silence import split_on_silence
from wit import Wit

def recognize_speech_wit_ai(file):
    client = Wit('MYKEY')
    audio = AudioSegment.from_wav(file)
    file_name = os.path.basename(file)
    offset = 10000

    chunks = split_audio_on_silence(audio)

    # Process each chunk 
    for i, chunk in enumerate(chunks):
        start_time = 0
        # Create a silence chunk that's 0.5 seconds (or 500 ms) long for padding.
        silence_chunk = AudioSegment.silent(duration=500)

        # Add the padding chunk to beginning and end of the entire chunk.
        audio_chunk = silence_chunk + chunk + silence_chunk

        # Normalize the entire chunk.
        normalized_chunk = match_target_amplitude(audio_chunk, -20.0)

        while normalized_chunk.duration_seconds > (start_time / 1000):
            # Works in milliseconds
            t1 = start_time  
            t2 = start_time + offset
            new_audio = normalized_chunk[t1:t2]
            new_audio.export('temp/chunk{0}'.format(i) + str(t1) + '.wav', format="wav")
            with open('temp/chunk{0}'.format(i) + str(t1) + '.wav', 'rb') as source:
                resp = client.speech(source, {'Content-Type': 'audio/wav'})
                print('Yay, got Wit.ai response: ' + str(resp['text']))
                f = open("output/" + file_name + '.txt', "a")
                f.write(resp['text'] + '\n')
                f.close()
                start_time += offset

def match_target_amplitude(aChunk, target_dBFS):
    #Normalize given audio chunk
    change_in_dBFS = target_dBFS - aChunk.dBFS
    return aChunk.apply_gain(change_in_dBFS)

def split_audio_on_silence(audio):
    chunks = split_on_silence(
        audio,
        min_silence_len=1000,
        silence_thresh=-50
    )

    return chunks

if __name__ == "__main__":
    recognize_speech_wit_ai('audio/output.wav')

ruoyipu commented 3 years ago

Thank you for the info! Do you also have the wit.ai app ID and a sample wav file you can attach?

wit-ai / pywit

Problem with speech recognition #151