Open TylerLeonhardt opened 2 years ago
Hi @TylerLeonhardt, that's an interesting subject.
I tried adding plaintext
as a programming language before. The plain text prediction accuracy was poor and the other languages prediction accuracy decreased as well.
After some testing, it looks like guesslang model works well when it is trained with texts that have a strict syntax (most source codes). But it doesn't perform well with texts that don't have strict syntax (plain text).
But there are few ideas to implement plain text detection in guesslang.
Guesslang current model is very simple/boring. It should be possible to create a more complex model that would handle plain text while remaining as accurate or more accurate than the current model. However, the current simple model is way faster to implement, train and troubleshoot than more complex models.
An alternative would be to create a new simple model that only tells if a given text is plain text or source code, without guessing the language. Then chain this new model with the current guesslang model as follows:
<input text>
|
▼
[Model 1: is plaintext?]
|
+-------------------+
| |
▼ ▼
yes no
| |
| ▼
| [Model 2: which language?]
| |
| +-------+-------+-------+
| | | | |
▼ ▼ ▼ ▼ ▼
Plaintext C C++ Java ...
I like both of these strategies :) plaintext is a tricky one. Thank you for the context!
cc @dynamicwebpaige if you have any thoughts
Any help on that is welcome!!
plaintext
is certainly an interesting case -- and I imagine that there would be similar concerns for .md
files.
Would it be possible to implement the chained approach in the near-term (perhaps with both plaintext
and markdown
), and consider a more comprehensive, general-purpose model as a future solution?
Yesterday I was bitten by this issue. I opened Docker Compose .env
file for one of my projects in VS Code and was surprised that its language was detected to be "Dockerfile". Of course Dockerfile language server was activated and the whole file was immediately full of red squiggles... Searching the Internet for similar issues did not give any results. At first I was thinking that it's a bug of Docker-related extensions. I had a look into their code, but could not find any logic for .env
files. Then I tried to disable these extensions in VS Code, language was still detected as "Dockerfile". Out of ideas I tried to open a very similar .env
file from a sister-project and (surprise!) it was detected as "Plain text" by VS Code. Only then I started looking at VS Code release notes (this and this) and eventually dig out Guesslang. As far as I can tell, "Dockerfile" vs. "Plain text" behaviour was triggered by the fact that one of the .env
files had COMPOSE_PROJECT_NAME
value vaguely resembling one of Dockerfile commands. This experience was very frustrating.
Speaking of potential solutions, I've heard that some machine learning models can return confidence as well as classification results, e.g. "This is definitely Java" or "This is probably C++" instead of just "This is Java" or "This is C++". If Guesslang could do that, than it would seem reasonable to fallback to plaintext
if the model is not confident enough about any of the supported languages. Otherwise the result is too unstable for languages that are not supported and trained for.
@i-ky does this happen to you if you install an extension like: https://marketplace.visualstudio.com/items?itemName=mikestead.dotenv
@i-ky does this happen to you if you install an extension like: https://marketplace.visualstudio.com/items?itemName=mikestead.dotenv
No, with DotENV extension the file is identified as "Environment Variables".
Do we have an attack plan for languages like Lua Python SQL? They are almost like written language especially SQL.
Another suggestion:
If the page was tokenized into words punctuation and whitespace using /(?=(?<=\s)\S|(?<=\S)\s|(?<=[\W_])|(?=[\W_])|(?<=\n)[^\n]|(?<=[^\n])[\n])/g
, the frequency and type of tokens would be an infinitely useful indicator of "is this source code".
Also the data from these metrics could be used to reduce the issue churn from the VS Code use case, so I think there is significant value in investigating that.
Looks like TF has a tokenizer that could break ground on this issue: BertTokenizer
We need to filter out camelCase
and UPPER_SNAKE_CASE_WITH_UNDERSCORE
tokens for accuracy however
guesslang's model in VS Code has been awesome. We will turn it on by default in the next version (shipping Wednesday/Thursday).
One of the biggest problems we face is related to folks using VS Code for notetaking.
They want to be able to write up simple notes in .txt-like files...but unfortunately, we run the model on these files with a variety of different results.
For example, take a look at: https://github.com/microsoft/vscode/issues/131912
where @Tyriar inserts a bunch of random lorem ipsum text... that gets detected as SQL sometimes and for me I've seen BATCH get chosen sometimes...
Ideally, guesslang could include txt as a language result that would probably help in these cases?