young-geng / EasyLM

Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax.
Apache License 2.0
2.38k stars 254 forks source link

Training OPT with Koala dataset #34

Closed Linohong closed 1 year ago

Linohong commented 1 year ago

Hi, thank you for opening such a nice work on public.

I have two issues I want to raise.

No.1, in the code for processing all the datasets, https://github.com/young-geng/koala_data_pipeline , I'm afraid there are some missing datasets. For example, in the line 14 of process_chat_data.py,

input_file='/nfs/vault/data/language/chat_data_v3.json'

above file must exists in order to run the file without an error. Where can I get all those input datasets that are listed in all the processing python files?

No.2, I've tried to look for the documentation on using the EasyLM library to fine-tune the OPT model with the Koala dataset, but there was only the documentation for fine-tuning the LLaMA model. Can I get the any documentation on finetuning, for example, OPT-6.7B with the Koala dataset?

Again, thank you so much for an amazing work!

young-geng commented 1 year ago

Regarding the chat dataset from the Koala data pipeline, that part is scraped from ShareGPT. Since we do not own the copyright of that data, we cannot release it. As for OPT, we haven't tried finetuning on OPT, since LLaMA is a strictly better model with less parameters.

Linohong commented 1 year ago

Thank you very much, Young-geng, for the reply :) One last thing, where can I find a code that actually create a string of the field such as , '[marker_user+human_0+marker_gpt],gpt_1,<|eos|>'?

Linohong commented 1 year ago

Oh, I think I've found it.

I think that's on the line class of TextProcessor at the line 47 in https://github.com/young-geng/EasyLM/blob/efcea8a1696dfff34964a09b38a08d0afab18173/EasyLM/data.py#L47

Again, thank you so much for an awesome work!