zavocc / JakeyBot

AI-powered multi-model Discord bot to try with Gemini 1.5 and other models from OpenRouter, Anthropic, GPT-4, Mistral, LLaMA, and More. in Discord! Try below or host your own
https://discord.gg/cAHKNv2CJT
10 stars 2 forks source link

Interesting VC ideas #4

Open zavocc opened 3 months ago

zavocc commented 3 months ago

Looking at https://guide.pycord.dev/voice/receiving

It appears its possible to also recieve audio to the bot. With this, its possible to create a voice mode for Gemini models back and forth

The outline of this implementation would be

  1. Using TTS and STT engines, preferably super fast and cost effective as possible if using clouds, best if natural sounding, with minimal latency as possible
  2. Using wavelink as a voice engine by streaming the TTS output.. in separate Cog
  3. Handle multiple requests if possible per server

The flow would be

  1. Initiate possibly through slash command like /call or something and lock the session to specific user when they initiated the command
  2. Record the voice conversation with timeout

On callback function

  1. The recorded voice is then sent to the Speech-to-text engine such as Whisper, either in OpenAI API (paid, faster), Azure Speech services (free in most cases, requires Azure dependencies) or Huggingface spaces (free, slow).... OR USE Gemini's native multimodality
  2. The transcription is now then used as a prompt to reason and engage (either with GPT or Gemini, with different system prompt optimized for speech
  3. Performs checks, if there is an error occured due to model, still proceed... But will speak the error, if there's an error with Speech APIs, abort and ping the user.
  4. Then the output is sent through dedicated TTS program and record
  5. When no errors occured, stream it
  6. Unlock and the command is now ready to be used by anyone

Possible limitation and outcomes:

  1. Possibility of blocking and highest latency if not using Asynchronous tools
  2. This command may be limited to one person at a time as a whole and not per user neither per guild, something that is being prototype how the flow works, for now, until this is being tested
  3. Prone to errors
  4. Chat history/Context handling, this would also require redundant code from /ask command
  5. Multimodality, though parameter maybe added in slash command, just would need to copy code from ask command but with more lines of code
  6. It cannot be initiated through voice, has to be invoked manually via slash command... Defeats the purpose of voice mode, but this should be considered as a basis for building block such implementation

Can be resolved; yes approx 80% success rate


Goals:

zavocc commented 3 months ago

OpenAI tts supports streaming https://platform.openai.com/docs/guides/text-to-speech/quickstart

zavocc commented 3 months ago

Implement GuildVoiceMgmt class

zavocc commented 1 month ago

Realtime api