Open TylerLeonhardt opened 2 years ago
Ahh... I guess the first 100000 characters together the guess is Julia... if I do a random chunk of 100000 it does yield TS... maybe I should chunk it and run it and then average out the results?
The start of the string seems to impact the result quite a bit which might be interesting for you.
The model actually only reads the first 10k characters, for performance reasons: https://github.com/yoeo/guesslang/blob/f4ceb1d3a39356eebd32abd8dd4d416d145e5a38/guesslang/model.py#L28
I guess that the repeated use bitwise operators and the relative lack of semicolons at the start of the file confused the model.
In fact, less than 1% of the 27k Typescript files that I randomly picked to train the model use the <<
operator.
However, I don't quite understand why the model picked Julia with such a high confidence.
if I do a random chunk of 100000 it does yield TS.
That's a great idea, I'll try that on the Python version too and see if it can improve the model overall accuracy
The model actually only reads the first 10k characters, for performance reasons:
Is this on training or inference? if the first 10k is only getting used, then I should probably only make strings with 10k characters :)
Is this on training or inference?
It reads the first 10k chars for both training & inference.
then I should probably only make strings with 10k characters
Absolutely.
So I was investigating https://github.com/microsoft/vscode/issues/129597
And I noticed that that issue was happening was because the file is absolutely massive. That might be a tfjs issue (cc @pyu10055)
What's interesting is that I was able to grab the first 125000 (anymore and it throws that ^) and run it through the model and it thought with 98% confidence that it is Julia and not TypeScript:
That seems very odd to me... Surely TypeScript would be in the top couple languages... this makes me think that it could be a bug in the model, but I'll leave that up to @yoeo to decide.