yingweima2022 / CodeLLM

9 stars 1 forks source link

Code instruct tuning dataset #4

Open zhentingqi opened 2 weeks ago

zhentingqi commented 2 weeks ago

Hi! You mentioned in the paper that "The second part comes from the open-source data CodeAlpaca (Chaudhary, 2023) and our build dataset, with 150K instructions." Could you elaborate on how you get the 150 K instructions? Are you planning to opensource these data? Thanks!

yingweima2022 commented 1 week ago

Thank you for your interest in our work. We constructed the 150K instruction dataset by leveraging several sources. Firstly, we utilized the training sets from CosQA [1] and MBPP [2] to create and supplement our code instruction data; both of these datasets are open-source. Additionally, we incorporated a cleaned code instruction set provided by the PanGu Alpha team, which is not publicly available at this time.

[1] CosQA Dataset [2] MBPP Dataset

For some reasons, we do not plan to open source our dataset at this time, but there are many open source instruction datasets in the community that may be useful to you:

Magicoder-Evol-Instruct-110K Evol-CodeAlpaca-v1 Code-Feedback

We hope this information is helpful for your research.