yoeo / guesslang

Detect the programming language of a source code
https://guesslang.readthedocs.io
MIT License
773 stars 110 forks source link

Add plaintext as a language (like .txt) #45

Open TylerLeonhardt opened 2 years ago

TylerLeonhardt commented 2 years ago

guesslang's model in VS Code has been awesome. We will turn it on by default in the next version (shipping Wednesday/Thursday).

One of the biggest problems we face is related to folks using VS Code for notetaking.

They want to be able to write up simple notes in .txt-like files...but unfortunately, we run the model on these files with a variety of different results.

For example, take a look at: https://github.com/microsoft/vscode/issues/131912

where @Tyriar inserts a bunch of random lorem ipsum text... that gets detected as SQL sometimes and for me I've seen BATCH get chosen sometimes...

Ideally, guesslang could include txt as a language result that would probably help in these cases?

yoeo commented 2 years ago

Hi @TylerLeonhardt, that's an interesting subject.

I tried adding plaintext as a programming language before. The plain text prediction accuracy was poor and the other languages prediction accuracy decreased as well. After some testing, it looks like guesslang model works well when it is trained with texts that have a strict syntax (most source codes). But it doesn't perform well with texts that don't have strict syntax (plain text).

But there are few ideas to implement plain text detection in guesslang.

A. Use a better machine learning model

Guesslang current model is very simple/boring. It should be possible to create a more complex model that would handle plain text while remaining as accurate or more accurate than the current model. However, the current simple model is way faster to implement, train and troubleshoot than more complex models.

B. Chain simple models

An alternative would be to create a new simple model that only tells if a given text is plain text or source code, without guessing the language. Then chain this new model with the current guesslang model as follows:

<input text>
    |
    ▼
[Model 1: is plaintext?]
    |
    +-------------------+
    |                   |
    ▼                   ▼
   yes                  no
    |                   |
    |                   ▼
    |           [Model 2: which language?]
    |                   |
    |           +-------+-------+-------+
    |           |       |       |       |
    ▼           ▼       ▼       ▼       ▼
  Plaintext     C      C++     Java    ...
TylerLeonhardt commented 2 years ago

I like both of these strategies :) plaintext is a tricky one. Thank you for the context!

TylerLeonhardt commented 2 years ago

cc @dynamicwebpaige if you have any thoughts

yoeo commented 2 years ago

Any help on that is welcome!!

dynamicwebpaige commented 2 years ago

plaintext is certainly an interesting case -- and I imagine that there would be similar concerns for .md files.

Would it be possible to implement the chained approach in the near-term (perhaps with both plaintext and markdown), and consider a more comprehensive, general-purpose model as a future solution?

i-ky commented 2 years ago

Yesterday I was bitten by this issue. I opened Docker Compose .env file for one of my projects in VS Code and was surprised that its language was detected to be "Dockerfile". Of course Dockerfile language server was activated and the whole file was immediately full of red squiggles... Searching the Internet for similar issues did not give any results. At first I was thinking that it's a bug of Docker-related extensions. I had a look into their code, but could not find any logic for .env files. Then I tried to disable these extensions in VS Code, language was still detected as "Dockerfile". Out of ideas I tried to open a very similar .env file from a sister-project and (surprise!) it was detected as "Plain text" by VS Code. Only then I started looking at VS Code release notes (this and this) and eventually dig out Guesslang. As far as I can tell, "Dockerfile" vs. "Plain text" behaviour was triggered by the fact that one of the .env files had COMPOSE_PROJECT_NAME value vaguely resembling one of Dockerfile commands. This experience was very frustrating.

i-ky commented 2 years ago

Speaking of potential solutions, I've heard that some machine learning models can return confidence as well as classification results, e.g. "This is definitely Java" or "This is probably C++" instead of just "This is Java" or "This is C++". If Guesslang could do that, than it would seem reasonable to fallback to plaintext if the model is not confident enough about any of the supported languages. Otherwise the result is too unstable for languages that are not supported and trained for.

TylerLeonhardt commented 2 years ago

@i-ky does this happen to you if you install an extension like: https://marketplace.visualstudio.com/items?itemName=mikestead.dotenv

i-ky commented 2 years ago

@i-ky does this happen to you if you install an extension like: https://marketplace.visualstudio.com/items?itemName=mikestead.dotenv

No, with DotENV extension the file is identified as "Environment Variables".

zm-cttae-archive commented 1 year ago

Do we have an attack plan for languages like Lua Python SQL? They are almost like written language especially SQL.

Another suggestion:

If the page was tokenized into words punctuation and whitespace using /(?=(?<=\s)\S|(?<=\S)\s|(?<=[\W_])|(?=[\W_])|(?<=\n)[^\n]|(?<=[^\n])[\n])/g, the frequency and type of tokens would be an infinitely useful indicator of "is this source code".

Also the data from these metrics could be used to reduce the issue churn from the VS Code use case, so I think there is significant value in investigating that.

zm-cttae-archive commented 1 year ago

Looks like TF has a tokenizer that could break ground on this issue: BertTokenizer

We need to filter out camelCase and UPPER_SNAKE_CASE_WITH_UNDERSCORE tokens for accuracy however