Need help regarding Data Format for Fine Tuning

RahulSundkar commented 1 year ago

Hi I'm a college student and I want to use the Koala model for a project. My group plans to make a Mental Health Therapist chatbot by fine tuning the Koala model on a dataset we have prepared by ourselves. The fine tuning will be done using PEFT and LoRA (as we do not have access to powerful GPUs) on google colab using the Hugging Face trainer.

The dataset consists of 1309 full length conversations between an AI therapist and a User. Each conversation is a single string and consists of the user asking questions about a problem they are facing and the AI answers their questions accordingly. An example of one conversation is given below :

WhatsApp Image 2023-06-19 at 12 08 17 AM This is one entry in a csv file (one big string)

My question is, how should I prepare this data for fine tuning the model? What is the format of data required? Should the User and AI utterances be handled in any specific way? Also are there any important tokens that should be included?

We have tried training the model on our data, but we do not know how to feed our data into the model for fine tuning. In one try we trained it with the dataset by directly feeding in those large strings. But while running the model it generates a whole conversation for one input.

I have some questions about the training part too. Can this model be finetuned using the Hugging Face trainer? Also any help with the training parameters would be very helpful.

Thank You for your time in advance!!!!!

Any inputs or advice for our project would be greatly appreciated.

float-trip commented 1 year ago

HuggingFace questions are probably a bit out of scope for this repo, but you might be interested in this project for straightforward LoRA finetuning: https://github.com/OpenAccess-AI-Collective/axolotl.

Alternatively, you could apply for the TPU Research Cloud and do a full finetune using EasyLM for free.

But while running the model it generates a whole conversation for one input.

That's expected. You can add <USER> and <AI> as custom tokens while training, and then use them as early stopping tokens at inference time.

young-geng commented 1 year ago

As float-trip mentioned, generally you should separate the user queries and model responses at the token level, and only apply losses for the model generated tokens. For controlling the model generation, you will need to prepend the model and user part with their corresponding special markers, and add an EOS token to the model part. You'll also need to ensure that loss is applied on that EOS token so the model learns to stop the generation when the response is complete.

Regarding Hugging Face trainer, you might want to get more help in the Hugging Face transformers repo.

young-geng / EasyLM

Need help regarding Data Format for Fine Tuning #63