Open zhentingqi opened 2 weeks ago
Thank you for your interest in our work. We constructed the 150K instruction dataset by leveraging several sources. Firstly, we utilized the training sets from CosQA [1] and MBPP [2] to create and supplement our code instruction data; both of these datasets are open-source. Additionally, we incorporated a cleaned code instruction set provided by the PanGu Alpha team, which is not publicly available at this time.
[1] CosQA Dataset [2] MBPP Dataset
For some reasons, we do not plan to open source our dataset at this time, but there are many open source instruction datasets in the community that may be useful to you:
Magicoder-Evol-Instruct-110K Evol-CodeAlpaca-v1 Code-Feedback
We hope this information is helpful for your research.
Hi! You mentioned in the paper that "The second part comes from the open-source data CodeAlpaca (Chaudhary, 2023) and our build dataset, with 150K instructions." Could you elaborate on how you get the 150 K instructions? Are you planning to opensource these data? Thanks!