xNul / chat-llama-discord-bot

A Discord Bot for chatting with LLaMA, Vicuna, Alpaca, MPT, or any other Large Language Model (LLM) supported by text-generation-webui or llama.cpp.
https://discord.gg/TcRGDV754Y
MIT License
118 stars 23 forks source link

TPS loss under bot #8

Open sfxworks opened 1 year ago

sfxworks commented 1 year ago

Under python3 server.py --wbits 4 --model ozcur_alpaca-native-4bit --verbose --listen --gpu-memory 5 --groupsize 128 via the UI, I get about 20 tokens per second

Output generated in 9.68 seconds (20.56 tokens/s, 199 tokens, context 42)

Under the bot with the same flags, I get only 2 python3 bot.py --wbits 4 --model ozcur_alpaca-native-4bit --verbose --listen --gpu-memory 5 --groupsize 128 Output generated in 10.30 seconds (2.04 tokens/s, 21 tokens, context 170)

The context seems to be the issue? As adding more to the context decreased it to 17.

Output generated in 7.73 seconds (17.20 tokens/s, 133 tokens, context 70)

Can the input for this bot be optimized?

xNul commented 1 year ago

Thanks, I've been able to reproduce on my end.

I made everything deterministic to see if there was a parameter I was missing, but with the same parameters, context, and sequence of inputs, I was able to produce the exact same response on both webui and the bot, the only difference being webui tokens were generated at 5.98 tokens/s and the bot tokens were generated at 2.30 tokens/s. Let me see where this thread goes.

xNul commented 1 year ago

I removed all the async code, discord bot stuff, any other unnecessary code to run the prompt, and called the API directly. Now I'm getting 6.15 tokens/s. I guess it's going to have something to do with async.

xNul commented 1 year ago

I found the issue. This line of code for streaming the text generation to Discord is blocking the token generation, slowing it down. Since Python async is concurrent and uses only one thread, throwing Message.edit calls to async won't work. The only option is to run Message.edit in a separate process which means doing some work with multiprocessing or a message queue. I'm looking into the different options.

xNul commented 1 year ago

Since the Client object in discord.py can't be serialized, it can't be moved to another process and used to make edits. This means that in order to keep performance and response streaming, I'll need to move all LLM logic into another process. From that process, I can then send messages via IPC back to the process for Discord and make those message edits for streaming.

Oh boy, I didn't realize this was going to be such a headache. I'm working on something else atm so I'm going to put this on the back-burner for a week or two. If you prefer to have performance over streaming, just remove that line I mentioned and you'll get the same performance as with webui.

sfxworks commented 1 year ago

I appreciate your diligence in looking into this issue!