Interesting VC ideas - Githubissues

zavocc commented 3 months ago

It appears its possible to also recieve audio to the bot. With this, its possible to create a voice mode for Gemini models back and forth

The outline of this implementation would be

Using TTS and STT engines, preferably super fast and cost effective as possible if using clouds, best if natural sounding, with minimal latency as possible
Using wavelink as a voice engine by streaming the TTS output.. in separate Cog
Handle multiple requests if possible per server

The flow would be

Initiate possibly through slash command like /call or something and lock the session to specific user when they initiated the command
Record the voice conversation with timeout

On callback function

The recorded voice is then sent to the Speech-to-text engine such as Whisper, either in OpenAI API (paid, faster), Azure Speech services (free in most cases, requires Azure dependencies) or Huggingface spaces (free, slow).... OR USE Gemini's native multimodality
The transcription is now then used as a prompt to reason and engage (either with GPT or Gemini, with different system prompt optimized for speech
Performs checks, if there is an error occured due to model, still proceed... But will speak the error, if there's an error with Speech APIs, abort and ping the user.
Then the output is sent through dedicated TTS program and record
When no errors occured, stream it
Unlock and the command is now ready to be used by anyone

Possible limitation and outcomes:

Possibility of blocking and highest latency if not using Asynchronous tools
This command may be limited to one person at a time as a whole and not per user neither per guild, something that is being prototype how the flow works, for now, until this is being tested
Prone to errors
Chat history/Context handling, this would also require redundant code from /ask command
Multimodality, though parameter maybe added in slash command, just would need to copy code from ask command but with more lines of code
It cannot be initiated through voice, has to be invoked manually via slash command... Defeats the purpose of voice mode, but this should be considered as a basis for building block such implementation

Can be resolved; yes approx 80% success rate

Goals:

zavocc commented 3 months ago

zavocc commented 3 months ago

Implement GuildVoiceMgmt class

zavocc commented 1 month ago

Realtime api

zavocc / JakeyBot