A training dataset generator for Guesslang's deep learning model.
GuesslangTools purpose is to find and download a million source code files. These files are used to train, evaluate and test Guesslang, a deep learning programming language detection tool.
The files are retrieved from more than 100k public open source GitHub repositories.
The million source code files used to feed Guesslang are generated as follows:
This workflow is fully automated but takes several hours to complete, especially the download part. Fortunately, it can be stopped and resumed at any moment.
GuesslangTools ensures that:
You can install GuesslangTools from the source code by running:
pip install .
You can run Guesslang tools on a terminal as follows:
gltool /path/to/generated_datasets/
Several options and hacks are available to fine tune the size and the diversity of the generated datasets. To list all the options, please run:
gltool --help
Guesslang icon created with AndroidAssetStudio
Repository dataset downloaded from Libraries.io Open Source Repository and Dependency Metadata
SQL repositories dataset retrieve from The Public Git Archive
GuesslangTools — Copyright (c) 2021 Y. SOMDA, MIT License