yoeo / guesslang

Detect the programming language of a source code
https://guesslang.readthedocs.io
MIT License
773 stars 110 forks source link

Guesslang in VS Code #29

Closed isidorn closed 2 years ago

isidorn commented 3 years ago

Hi there,

My name is Isidor and I work on VS Code. We have the following problem:

Novice user create a new untitled files and start typing and they have no clue that they have to set the language mode to get all the language smartness that VS Code provides. Thus we were thinking to use some smart language detection so we could automatically set the language for the user.

I was doing a bit of online research and I came across this project - looks very cool!

Is it possible to somehow have this work as a node module instead of python? Since then we could consume it easily in VS Code and things might just work. Even cooler would be if it worked in the browser.

Let me know if you are interested we can also setup a meeting where I could explain our use case in more detail. Thanks!

yoeo commented 3 years ago

Hello @isidorn

Actually, I tried to convert Guesslang model to Javascript two years ago because I wanted to create an Atom extension, a VS Code extension and a Javascript front-end library.

But it was a real challenge to convert the model to TensorFlow.js because several elements of the models where not implemented in TensorFlow.js at that time. I tried different options like simplifying Guesslang model, splitting the model, compiling the missing elements from C++ to Web Assembly, etc... While some solutions "worked", the result was way too buggy and dirty to ship.

By the way, sorry for the late answer.

yoeo commented 3 years ago

It would be nice to have the insight of someone who managed to convert a canned TensorFlow model to TensorFlow.js.

isidorn commented 3 years ago

@yoeo thanks for your answer. We are actually working with the Tensorflow team so they support some of the missing funcitonality so we can convert this model to JS. Please check out this issue https://github.com/tensorflow/tfjs/issues/4838 Let me also ping the Tensorflow team again so we try to get some progress here. It would be super cool to have this in VS Code.

isidorn commented 3 years ago

@yoeo the Tensorflow team have just updated TensorflowJS so now it is possible to run your model in the browser, for more details checkout this comment https://github.com/tensorflow/tfjs/issues/4838#issuecomment-850730654

We will look into adopting this in VS Code next milestone in June. In the meantime is it possible for the Model to be updated to also Classify JSON? This is a very common language for our users and it would be great if the model could support it.

yoeo commented 3 years ago

Hi @isidorn that's really good news.

Yes it is possible to add JSON. However, it will take some time to generate a new training dataset that includes JSON and other requested languages like VisualBasic, Pascal, Kotlin, XML, YAML, etc... By the way the part that takes the most time is actually downloading ~1TB of repositories for Github.

isidorn commented 3 years ago

@yoeo cool, it would be really useful for us to add JSON when possible. Makes sense that the 1TB download is the slowest part... Thanks a lot and I will provide more feedback in a couple of weeks when I try to integrate all of this into VS Code.

isidorn commented 3 years ago

There's progress on our side and we are looking into adding this to our VS Code product. More details can be found here https://github.com/tensorflow/tfjs/issues/4838#issuecomment-859275362 and https://github.com/microsoft/vscode/issues/118455

Two things we would still really like to improve in order to ship this:

@yoeo if you do not have time can you give us some instructions on how to do the first. So we might also put a help wanted in the VS Code repository, since we have lots of contributors maybe somebody will volunteer.

Thanks a lot for this great model

isidorn commented 3 years ago

We were able to compress to model using gzip to be 20kb. So we are good regarding the size. However improving the model for more languages would be fantastic.

yoeo commented 3 years ago

Wow I wonder how you were able to compress the model that much, that's insane!

more languages support (as we discussed above)

I'll try experiment that this week end.

can you give us some instructions on how to do the first

Of course. I use this tool to build the dataset: https://github.com/yoeo/guesslangtools You can find some documentation in the README but here's a quick guide on how to add new languages with this tool:

# Install Gesslang & GuesslangTools inside your virtualenv in developer mode
git clone git@github.com:yoeo/guesslang.git guesslang
cd guesslang/
pip install -e .
cd ..

git clone git@github.com:yoeo/guesslangtools.git guesslangtools
cd guesslangtools/
pip install -e .
cd ..

# Add the new languages to the language mapping
vi guesslang/guesslang/data/languages.json
cp guesslang/guesslang/data/languages.json guesslangtools/guesslangtools/data/languages.json

# Build the dataset (might take few days, depending on your Internet connection)
DESTINATION_PATH=...  # dataset directory, 1TB of free space recommended
gltool $DESTINATION_PATH

# Train the model (might take few hours, depending on your computer speed)
guesslang --train $DESTINATION_PATH --model ./new_model/

# Play with the new model
echo '
#include <stdio.h>

int main(int argc, char* argv[])
{
  printf("Hello world");
}
' | guesslang --model ./new_model/  # Should output "Programming language: C"

Thanks for the updates @isidorn.

yoeo commented 3 years ago

Hello, just an update.

I'm trying to add the following languages to Guesslang:

I built the list according the requests https://github.com/yoeo/guesslang/issues/24 https://github.com/yoeo/guesslang/issues/23 https://github.com/yoeo/guesslang/issues/19 , Tiobe language index and Stackoverflow popular languages .

But there are few issues:

Therefore my current strategy is to add the "simplest" languages first then bump Guesslang and take some time to work on the more tricky languages.

Thanks.

isidorn commented 3 years ago

@yoeo Thanks a lot for looking into this and for providing an update. Starting with the simplest languages makes good sense to me.

As for the compression: the model after being converted for TensorFlowJS was .json and that seems to be easily compressible.

If needed I can put help-wanted on the vscode issues, and somebody from the community can also help here. Just let me know... We would love to ship this feature in July / August, but at the end of the day there is no rush, we would like to get it right and there are other things we need to look at

yoeo commented 3 years ago

Hi @isidorn

I've made some progress during last couple of weekends. There is now a development version of Guesslang model that supports most of the languages listed above (including JSON).

You can try it at https://github.com/yoeo/guesslang/pull/33

There are still issues that I need to solve before merging it:

  1. the dataset download was taking forever :hourglass_flowing_sand: I had to refactor Guesslangtool to speed things up a little. This work is still ongoing... See https://github.com/yoeo/guesslangtools/pull/4
  2. the "Pascal" language training dataset is broken, I spotted the issue too late and now I have to generate a new ones
  3. the dataset is currently skewed, some languages have way more example files than other (ex: 27k examples for Kotlin versus 9k examples for TOML). I'll have to find more example to balance the dataset.
isidorn commented 3 years ago

@yoeo this is great work, thanks a lot for the update 👏 I think @tylerleonhardt plans to jump on this next milestone (July), so the timings seem to align. I think we still have to streamline the conversion to TensorFlowJS as captured here.

Next three weeks I will be on vacation, so expect slow responses from me.

yoeo commented 2 years ago

Hi @isidorn , @TylerLeonhardt

I just finished adding the new languages support to Guesslang https://github.com/yoeo/guesslang/pull/33 It now supports Guesslang 54 programming languages (24 more than before):

Languages
Assembly Batchfile C C# C++
Clojure CMake COBOL CoffeeScript CSS
CSV Dart DM Dockerfile Elixir
Erlang Fortran Go Groovy Haskell
HTML INI Java JavaScript JSON
Julia Kotlin Lisp Lua Makefile
Markdown Matlab Objective-C OCaml Pascal
Perl PHP PowerShell Prolog Python
R Ruby Rust Scala Shell
SQL Swift TeX TOML TypeScript
Verilog Visual Basic XML YAML

Feel free to tell me if you have feedbacks about this new model.

ghost commented 2 years ago

Being able to tell the difference between Matlab Objective-C and Julia is killer, thanks*e(6)!

isidorn commented 2 years ago

@yoeo this is amazing, thanks a lot 👏 We really appreciate your help. I just came back from vacation, but @tylerleonhardt was working on this and there is already a prototype of this in VS Code. More details can be found here https://github.com/microsoft/vscode/issues/129004 In short if you want to try it out you should set workbench.editor.untitled.experimentalLanguageDetection in vs code settings. We are about to pick up your latest model.

TylerLeonhardt commented 2 years ago

I think we can go ahead and close this issue now :) the guesslang model now ships in VS Code and beyond misc improvements to the model, there's no more action items here.

yoeo commented 2 years ago

In short if you want to try it out you should set workbench.editor.untitled.experimentalLanguageDetection in vs code settings.

Thanks for the info. You've got a new beta-tester :slightly_smiling_face:

I'm closing this issue and do not hesitate to create new ones for improvement requests.

isidorn commented 2 years ago

For reference here's the test plan item on the vscode side that has good steps on how to setup https://github.com/microsoft/vscode/issues/129436