More local model options

bennmann commented 1 year ago

Hello,

What lines might one change to use model.generate of a local model on the same host?

I have a 16GB VRAM gaming GPU and have run local inference on bloomz-7B, RWKV 14B, Pythia 12B.

I want to be able to simply change a few lines to generate from a local model instead of hosting an alpa served version.

Thanks for your thoughts and consideration.

yangkevin2 commented 1 year ago

Hi,

The util function here https://github.com/yangkevin2/doc-story-generation/blob/main/story_generation/common/util.py#L927 interfaces with Alpa to get next-token logprobs. You could try changing that to use your local model instead. Just be aware that the quality of generated text might be a lot worse using a much smaller model, though.

Thanks, Kevin

bennmann commented 1 year ago

Thank you! I created a branch and am now navigating my own personal dependency purgatory (using AMD GPU, ROCM, accelerate, and bitsandbytes 8bit, etc). I will test these changes from the branch I just split and use model https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b

Thanks for your guidance, I will remember you if it makes cool stories.

bennmann commented 1 year ago

Hi Kevin and anyone else,

There are also a lot of openai calls in various functions - maybe I can find them all and change them, this turned into more of a weekend project than an afternoon project, so I will be delayed replacing every openai call with a local model.generate equivalent.

I may return to this a little at a time, best guess a few weeks to completion as I've dug a little deeper over time. Anyone else feel free to look into my branch and try to suggest openai replacement model.generate equivalents.

-Ben

yangkevin2 commented 1 year ago

Oh, yeah if you don't want to use the GPT3 API at all you'll have to replace all of those. Sorry, thought you meant just the Alpa stuff.

As an additional note, using your local models on a 16GB GPU will also pretty seriously compromise the quality of the resulting outputs, especially the plan/outline generation-- I'm not convinced that that part would work at all without using an instruction-tuned model (specifically text-davinci-002, since that supports suffix context in addition to a prompt). And in our preliminary experiments using "smaller" 13B models for the main generation procedure, the story quality was quite a bit worse too.

bennmann commented 1 year ago

I have great hope for producing about 1 good generation out of less than 20 attempts with models today. I agree the quality in general will require more cherry picking outputs (reprompting?).

With the improvements coming to the smaller models (such as Llama 13B competing with 175B GPT) getting a fully functional single GPU storyteller before new models come out seems worthwhile to me.

I am very happy with the concepts in your paper and work the more I consider the works potential. And open source!

On Wed, Mar 15, 2023, 12:52 PM Kevin Yang @.***> wrote:

Oh, yeah if you don't want to use the GPT3 API at all you'll have to replace all of those. Sorry, thought you meant just the Alpa stuff.

As an additional note, using your local models on a 16GB GPU will also pretty seriously compromise the quality of the resulting outputs, especially the plan/outline generation-- I'm not convinced that that part would work at all without using an instruction-tuned model (specifically text-davinci-002, since that supports suffix context in addition to a prompt). And in our preliminary experiments using "smaller" 13B models for the main generation procedure, the story quality was quite a bit worse too.

— Reply to this email directly, view it on GitHub https://github.com/yangkevin2/doc-story-generation/issues/2#issuecomment-1470397203, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMUTTWJCVKAEZGFXGGDEJLW4HXTBANCNFSM6AAAAAAV2NXL6I . You are receiving this because you authored the thread.Message ID: @.***>

yangkevin2 commented 1 year ago

Yeah, if you're willing to do a bit of manual cherry-picking / interaction, then the requirements on model quality definitely go down significantly. I haven't tested with the new LLaMA models, but I agree it's likely they'd work better than the ones we tried previously (e.g., GPT-13B non-instruction-tuned). Would be curious to hear how it goes if you do end up trying that out.

Glad you enjoyed the work!

yangkevin2 / doc-story-generation

More local model options #2