Unload the LLM from VRAM after each call?

Pdonor commented 1 month ago

Hi! With the new version of Forge, and FLUX, this extension could be really practical for the millions of low VRAM laptops that can now run FLUX. The only problem is that it doesn't unload the LLM from VRAM when using Ollama, so the generation is way too slow.

According to https://github.com/ollama/ollama/issues/1600 , that can be accomplished with ''' curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": 0}' ''' Can that be put in your code?

Also, could it be set to store a different system prompt and ollama settings? I found that giving it an example in the system prompt works well.

Basically, it seems you are a few lines of code away from the best 'magic prompt' software in the world, surpassing the ones on Dalle-3 and Ideogram, which are censored. Thank you!

kmdtukl commented 1 month ago

add to environment variable OLLAMA_KEEP_ALIVE 0

xlinx commented 4 weeks ago

okie, let me try try unload; is these actions that u want:

active generate forever
calling LLM
LLM answer
unload LLM save VRAM. by (http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": 0}' ' ')
sd-web-ui working
sd finished.
re-call step 1.

( i use 4060ti 16g vram, so i usually load 7B LLM with SDXL is fine for me.)

is action like this?
more addtional call each LLM-call?

螢幕擷取畫面 2024-08-17 040151

BTW, if active web-ui [generate forever]. u can consider use another one extension whic can send ur fantastic LLM sd-result to IM app. review like comic book on ur mobile phone. its fun. https://github.com/xlinx/sd-webui-decadetw-auto-messaging-realtime

xlinx / sd-webui-decadetw-auto-prompt-llm

Unload the LLM from VRAM after each call? #3