yoeo / guesslang

Detect the programming language of a source code
https://guesslang.readthedocs.io
MIT License
798 stars 114 forks source link

How to add new language #40

Open yjmm10 opened 3 years ago

yjmm10 commented 3 years ago

Hello, if I want to do migration training based on yours, can I use the trained model? I tried to load the trained model but no effect, I hope to get your reply

yjmm10 commented 3 years ago

when i download dataset from guesslangtool, many repo is not exist, and github server reject my request.

yoeo commented 3 years ago

Hello @yjmm10

I tried to load the trained model but no effect, I hope to get your reply

I think that the current model doesn't suit transfer learning very well. The list of supported languages is embedded in the model graph itself. Mean that you'll have to hack the graph somehow to add new languages info. That might change in future versions but currently there are few blockers (I can go more in details if required).

Today the only recommended way to add new languages is to build a dataset including the new languages with guesslangtools.

yoeo commented 3 years ago

when i download dataset from guesslangtool, many repo is not exist,

Yes that's expected, the Github public repository list that I use was last updated on January 2020 https://zenodo.org/record/3626071/ You can safely ignore this warning.

yoeo commented 3 years ago

github server reject my request

Strange... Guesslangtools main workflow only rely on git commands because, as far as I know, they are not (yet) restricted by Github servers. Github website & API are heavily restricted though.

Can you share the errors that you're getting?

yjmm10 commented 3 years ago

github server reject my request

Strange... Guesslangtools main workflow only rely on git commands because, as far as I know, they are not (yet) restricted by Github servers. Github website & API are heavily restricted though.

Can you share the errors that you're getting? The above exception was the direct cause of the following exception:

Thank you for your reply. This is Error message,  when download the zip file, it often happen。
Traceback (most recent call last):
  File "I:/Private/guesslangtools/guesslangtools/__main__.py", line 104, in <module>
    main()
  File "I:/Private/guesslangtools/guesslangtools/__main__.py", line 89, in main
    run_workflow()
  File "I:\Private\guesslangtools\guesslangtools\app.py", line 14, in run_workflow
    compressed_repositories.download()
  File "I:\Private\guesslangtools\guesslangtools\common.py", line 112, in wrapped
    result = func(*args, **kw)
  File "I:\Private\guesslangtools\guesslangtools\workflow\compressed_repositories.py", line 100, in download
    for step, row in enumerate(pool_imap(_download_repository, rows), 1):
  File "I:\Private\guesslangtools\guesslangtools\common.py", line 213, in pool_imap
    for result in pool.imap(_apply, iterable):
  File "D:\.conda\envs\base\lib\multiprocessing\pool.py", line 868, in next
    raise value
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host.', None, 10054, None))

Process finished with exit code 1

or

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "I:/Private/guesslangtools/guesslangtools/__main__.py", line 104, in <module>
    main()
  File "I:/Private/guesslangtools/guesslangtools/__main__.py", line 89, in main
    run_workflow()
  File "I:\Private\guesslangtools\guesslangtools\app.py", line 14, in run_workflow
    compressed_repositories.download()
  File "I:\Private\guesslangtools\guesslangtools\common.py", line 112, in wrapped
    result = func(*args, **kw)
  File "I:\Private\guesslangtools\guesslangtools\workflow\compressed_repositories.py", line 97, in download
    for step, row in enumerate(pool_imap(_download_repository, rows), 1):
  File "I:\Private\guesslangtools\guesslangtools\common.py", line 213, in pool_imap
    for result in pool.imap(_apply, iterable):
  File "D:\.conda\envs\base\lib\multiprocessing\pool.py", line 868, in next
    raise value
ValueError: check_hostname requires server_hostname

Process finished with exit code 1
yoeo commented 3 years ago

Okay @yjmm10, it looks like you are using an older version of guesslangtools (version < 1.0). Older version of gueslangtools was downloading the repositories directly from Github HTTP servers. And due to Github HTTP servers restrictions (like the ones that you are experiencing) I switched to using Git command instead.

You can install guesslangtools latest version with the following commands

# Clone the latest version of the code
git clone https://github.com/yoeo/guesslangtools.git
cd guesslangtools

# Edit the language description file to add the new languages information
vi data/languages.yaml

# Install the new Guesslangtools on your system
pip install -Ue .
yoeo commented 3 years ago

After installing guesslangtools you can run it to generate the dataset:

# You can change the --nb-xxx parameters to have more or less examples in your dataset
gltool /path/to/new/dataset

It will take hours, and when it is done, you can train Guesslang:

# Clone Guesslang
git clone https://github.com/yoeo/guesslang.git
cd guesslang

# Install Guesslang in "developper mode"
pip install -Ue .

# Copy the language mapping generated in the dataset (`languages.json`) into Guesslang repository
cp /path/to/new/dataset/languages.json ./data/languages.json

# Run the training
guesslang --train /path/to/new/dataset/files --steps 10000 --model /path/to/new/model

I'm using Linux command line syntax here, and I hope that it won't be hard to convert them into Window shell commands.