yoeo / guesslangtools

Tool to build a training dataset for Guesslang, the programming language guesser
MIT License
22 stars 8 forks source link

RuntimeError: Need more than 3, only 2 repositories usable for language TOML #7

Open cao-nv opened 2 years ago

cao-nv commented 2 years ago

Impressed by the simplicity and accuracy of GuessLang model, but the model is built on Tensorflow Estimator which is out-dated and cumbersome to use in other language, I want to update the GuessLang model to TF 2.0 with better architecture for both Python and other language for deployment, and hopefully a higher accuracy. I got this error when running GuessLangTools to download and prepare the dataset. How can I get around this problem? For example: manually add TOML repos, skip this language....

Thanks

07:54:07 WARNING: Checking extensions: "h" is associated with more than one language: ['C', 'C++', 'Objective-C']
07:54:07 WARNING: Checking extensions: "hh" is associated with more than one language: ['C++', 'PHP']
07:54:07 WARNING: Checking extensions: "m" is associated with more than one language: ['Matlab', 'Objective-C']
07:54:07 WARNING: Checking extensions: "pl" is associated with more than one language: ['Perl', 'Prolog']
07:54:07 INFO: Found in the cache: /mnt/hdd/guesslangtools/dataset/01_repositories_dataset.tar.gz
07:54:07 INFO: Found in the cache: /mnt/hdd/guesslangtools/dataset/02_repositories_dataset.csv
07:54:07 INFO: Found in the cache: /mnt/hdd/guesslangtools/dataset/03_shrunk_repositories_dataset.csv
07:54:07 INFO: Found in the cache: /mnt/hdd/guesslangtools/dataset/04_altered_repositories_dataset.csv
07:54:07 INFO: Found in the cache: /mnt/hdd/guesslangtools/dataset/05_selected_repositories.csv
07:54:07 INFO: Found in the cache: /mnt/hdd/guesslangtools/dataset/06_prepare_repositories_to_download.csv
07:54:07 INFO: Found in the cache: /mnt/hdd/guesslangtools/dataset/07_downloaded_repositories.csv
07:54:07 INFO: Found in the cache: /mnt/hdd/guesslangtools/dataset/09_deduplicated_files.csv
07:54:07 INFO: Found in the cache: /mnt/hdd/guesslangtools/dataset/09_deduplicated_files.csv
07:54:07 INFO: Split repositories by usage: train, valid & test
07:54:07 INFO: This operation should take few seconds...
07:54:35 INFO: Total downloaded repositories: 272292
07:54:35 INFO: Assembly nb repositories, train: 4408, valid: 944, test: 944
07:54:35 INFO: Batchfile nb repositories, train: 3956, valid: 847, test: 847
07:54:35 INFO: C nb repositories, train: 4403, valid: 943, test: 943
07:54:35 INFO: C# nb repositories, train: 4297, valid: 920, test: 920
07:54:35 INFO: C++ nb repositories, train: 4401, valid: 943, test: 943
07:54:35 INFO: Clojure nb repositories, train: 4871, valid: 1043, test: 1043
07:54:35 INFO: CMake nb repositories, train: 4084, valid: 875, test: 875
07:54:35 INFO: COBOL nb repositories, train: 225, valid: 48, test: 48
07:54:35 INFO: CoffeeScript nb repositories, train: 4063, valid: 870, test: 870
07:54:35 INFO: CSS nb repositories, train: 3930, valid: 841, test: 841
07:54:35 INFO: CSV nb repositories, train: 2, valid: 1, test: 1
07:54:35 INFO: Dart nb repositories, train: 2825, valid: 605, test: 605
07:54:35 INFO: DM nb repositories, train: 269, valid: 57, test: 57
07:54:35 INFO: Dockerfile nb repositories, train: 1683, valid: 360, test: 360
07:54:35 INFO: Elixir nb repositories, train: 4065, valid: 871, test: 871
07:54:35 INFO: Erlang nb repositories, train: 3698, valid: 792, test: 792
07:54:35 INFO: Fortran nb repositories, train: 4411, valid: 945, test: 945
07:54:35 INFO: Go nb repositories, train: 4494, valid: 962, test: 962
07:54:35 INFO: Groovy nb repositories, train: 4561, valid: 977, test: 977
07:54:35 INFO: Haskell nb repositories, train: 4944, valid: 1059, test: 1059
07:54:35 INFO: HTML nb repositories, train: 4159, valid: 890, test: 890
07:54:35 INFO: INI nb repositories, train: 4, valid: 1, test: 1
07:54:35 INFO: Java nb repositories, train: 4427, valid: 948, test: 948
07:54:35 INFO: JavaScript nb repositories, train: 4378, valid: 937, test: 937
07:54:35 INFO: JSON nb repositories, train: 33, valid: 6, test: 6
07:54:35 INFO: Julia nb repositories, train: 3966, valid: 849, test: 849
07:54:35 INFO: Kotlin nb repositories, train: 4234, valid: 906, test: 906
07:54:35 INFO: Lisp nb repositories, train: 4518, valid: 968, test: 968
07:54:35 INFO: Lua nb repositories, train: 4183, valid: 895, test: 895
07:54:35 INFO: Makefile nb repositories, train: 3968, valid: 849, test: 849
07:54:35 INFO: Markdown nb repositories, train: 2393, valid: 512, test: 512
07:54:35 INFO: Matlab nb repositories, train: 4691, valid: 1005, test: 1005
07:54:35 INFO: Objective-C nb repositories, train: 4483, valid: 960, test: 960
07:54:35 INFO: OCaml nb repositories, train: 4441, valid: 951, test: 951
07:54:35 INFO: Pascal nb repositories, train: 4229, valid: 906, test: 906
07:54:35 INFO: Perl nb repositories, train: 4389, valid: 940, test: 940
07:54:36 INFO: PHP nb repositories, train: 3539, valid: 757, test: 757
07:54:36 INFO: PowerShell nb repositories, train: 3889, valid: 833, test: 833
07:54:36 INFO: Prolog nb repositories, train: 3443, valid: 737, test: 737
07:54:36 INFO: Python nb repositories, train: 4458, valid: 954, test: 954
07:54:36 INFO: R nb repositories, train: 4743, valid: 1016, test: 1016
07:54:36 INFO: Ruby nb repositories, train: 4421, valid: 947, test: 947
07:54:36 INFO: Rust nb repositories, train: 3771, valid: 808, test: 808
07:54:36 INFO: Scala nb repositories, train: 4289, valid: 918, test: 918
07:54:36 INFO: Shell nb repositories, train: 4138, valid: 886, test: 886
07:54:36 INFO: SQL nb repositories, train: 3069, valid: 657, test: 657
07:54:36 INFO: Swift nb repositories, train: 4310, valid: 923, test: 923
07:54:36 INFO: TeX nb repositories, train: 4545, valid: 973, test: 973
Traceback (most recent call last):
  File "/home/cao/anaconda3/envs/onnx/bin/gltool", line 8, in <module>
    sys.exit(main())
  File "/home/cao/anaconda3/envs/onnx/lib/python3.7/site-packages/guesslangtools/__main__.py", line 153, in main
    run_workflow(config)
  File "/home/cao/anaconda3/envs/onnx/lib/python3.7/site-packages/guesslangtools/app.py", line 19, in run_workflow
    source_files.split(config)
  File "/home/cao/anaconda3/envs/onnx/lib/python3.7/site-packages/guesslangtools/common.py", line 175, in wrapped
    result = func(config, *args, **kw)
  File "/home/cao/anaconda3/envs/onnx/lib/python3.7/site-packages/guesslangtools/workflow/source_files.py", line 268, in split
    f'Need more than {MIN_REPOSITORIES}, '
RuntimeError: Need more than 3, only 2 repositories usable for language TOML
Erik-Handeland commented 1 year ago

Did you ever figure out a solution? I'm getting a different error UnicodeDecodeError in repositories_dataset.py", line 67, but really just looking for an updated version of the model using TF 2, so I can convert it to CoreML