Code instruct tuning dataset

Thank you for your interest in our work. We constructed the 150K instruction dataset by leveraging several sources. Firstly, we utilized the training sets from CosQA [1] and MBPP [2] to create and supplement our code instruction data; both of these datasets are open-source. Additionally, we incorporated a cleaned code instruction set provided by the PanGu Alpha team, which is not publicly available at this time.

[1] CosQA Dataset [2] MBPP Dataset

For some reasons, we do not plan to open source our dataset at this time, but there are many open source instruction datasets in the community that may be useful to you:

Magicoder-Evol-Instruct-110K Evol-CodeAlpaca-v1 Code-Feedback

We hope this information is helpful for your research.

yingweima2022 / CodeLLM

Code instruct tuning dataset #4