wit-ai / wit

Natural Language Interface for apps and devices
https://wit.ai/
931 stars 91 forks source link

POST/speech returns no intent #2666

Closed jandieg closed 11 months ago

jandieg commented 11 months ago

Question

What is the current behavior? POST to /speech with a short ogg file (captured from WhatsApp) saying "send justification" returns: { "entities": {}, "intents": [], "text": "", "traits": {} } Same words "send justification" sent as /message do respond with the correct intent.

What is the expected behavior? Produce the same intent as the text sent to /message.

If applicable, what is the App ID where you are experiencing this issue? If you do not provide this, we cannot help. 582793603989068

Attached audio sample. audio.zip

nishsinghal20 commented 11 months ago

Hi @jandieg! Thank you for bringing this to our attention. It seems that the audio file you're using has a sample rate of 48K. While we take a deeper look at why our system is not able to process this audio, I was able to confirm that if you downsample your audio file to 16K sample rate, it works as expected (I've attached the downsampled file for your reference) audio_new.zip

In order to unblock yourself for the time being, please use a sample rate of 16K.

jandieg commented 11 months ago

So this is how Whatsapp collects audio. Aren't you going to look at this issue? WhatsApp is quite popular, and an important point of entry into conversational space. I am wondering why this being closed if so

nishsinghal20 commented 11 months ago

Hey @jandieg! We are definitely looking into this issue on our end. Our system should be able to accept a variety of audio formats. We should have a fix for this soon.

nomiero commented 10 months ago

@jandieg We found that the issue is not because of the sample rate but because the ogg file has opus codec. Our API currently assumes that ogg files use only Vorbis codec. For now in the short term, we added a workaround for this that if you provide the content-type as audio/opus, it should work. In the long term, we should be able to infer the metadata from the audio file itself.

jandieg commented 10 months ago

Thank you. I started using Whisper for the time being. Any pros/cons you could highlights around wit vs whisper?