yoeo / guesslang

Detect the programming language of a source code
https://guesslang.readthedocs.io
MIT License
773 stars 110 forks source link

Giant TypeScript file being called Julia #38

Open TylerLeonhardt opened 2 years ago

TylerLeonhardt commented 2 years ago

So I was investigating https://github.com/microsoft/vscode/issues/129597

And I noticed that that issue was happening was because the file is absolutely massive. That might be a tfjs issue (cc @pyu10055)

What's interesting is that I was able to grab the first 125000 (anymore and it throws that ^) and run it through the model and it thought with 98% confidence that it is Julia and not TypeScript:

[
  { languageId: 'jl', confidence: 0.9822742342948914 },
  { languageId: 'scala', confidence: 0.016035545617341995 },
  { languageId: 'hs', confidence: 0.0016901468625292182 },
  { languageId: 'pas', confidence: 1.3825314226778573e-7 },
  { languageId: 'cpp', confidence: 3.57413348917035e-10 },
  { languageId: 'ml', confidence: 1.8221162426113047e-12 },
  { languageId: 'js', confidence: 8.594037618049957e-14 },
  { languageId: 'ts', confidence: 4.65358850900658e-14 },
  { languageId: 'vba', confidence: 7.618220256835808e-15 },
  { languageId: 'go', confidence: 5.723856057773509e-15 },
  { languageId: 'groovy', confidence: 3.010156677073889e-15 },
  { languageId: 'dart', confidence: 1.0165829367361526e-16 },
  { languageId: 'c', confidence: 3.543757880335173e-17 },
  { languageId: 'cs', confidence: 8.10666447931774e-18 },
  { languageId: 'swift', confidence: 3.1402044595342857e-18 },
  { languageId: 'mm', confidence: 7.613760166801732e-19 },
  { languageId: 'ps1', confidence: 5.315139498838977e-19 },
  { languageId: 'pm', confidence: 7.787649396626034e-21 },
  { languageId: 'md', confidence: 1.0253120787115895e-21 },
  { languageId: 'html', confidence: 3.355766328067147e-23 },
  { languageId: 'py', confidence: 1.167890584151069e-23 },
  { languageId: 'xml', confidence: 5.1753314350333106e-24 },
  { languageId: 'v', confidence: 2.9903643691862024e-25 },
  { languageId: 'ini', confidence: 5.667593976570901e-26 },
  { languageId: 'dm', confidence: 1.954242193090497e-26 },
  { languageId: 'sql', confidence: 5.364698274948056e-27 },
  { languageId: 'f90', confidence: 5.033385168846986e-27 },
  { languageId: 'php', confidence: 3.280049747541946e-27 },
  { languageId: 'lua', confidence: 3.0276543172140653e-27 },
  { languageId: 'coffee', confidence: 5.953645533418734e-28 },
  { languageId: 'java', confidence: 4.882338335168584e-28 },
  { languageId: 'r', confidence: 1.2599295269966544e-28 },
  { languageId: 'rb', confidence: 5.683628108915077e-29 },
  { languageId: 'erl', confidence: 2.1686813768150945e-29 },
  { languageId: 'tex', confidence: 2.5856190667805688e-30 },
  { languageId: 'prolog', confidence: 1.3216966443768018e-33 },
  { languageId: 'rs', confidence: 1.0031414405593044e-33 },
  { languageId: 'asm', confidence: 8.360548368235013e-34 },
  { languageId: 'matlab', confidence: 1.6285131136511483e-34 },
  { languageId: 'csv', confidence: 5.227884277503055e-35 },
  { languageId: 'sh', confidence: 8.97545819514535e-39 },
  { languageId: 'yaml', confidence: 1.8427074805871344e-42 },
  { languageId: 'ex', confidence: 7.833258415575727e-43 },
  { languageId: 'bat', confidence: 6.445972935894159e-44 },
  { languageId: 'kt', confidence: 2.6624670822171524e-44 },
  { languageId: 'clj', confidence: 0 },
  { languageId: 'cmake', confidence: 0 },
  { languageId: 'cbl', confidence: 0 },
  { languageId: 'css', confidence: 0 },
  { languageId: 'dockerfile', confidence: 0 },
  { languageId: 'json', confidence: 0 },
  { languageId: 'lisp', confidence: 0 },
  { languageId: 'makefile', confidence: 0 },
  { languageId: 'toml', confidence: 0 }
]

That seems very odd to me... Surely TypeScript would be in the top couple languages... this makes me think that it could be a bug in the model, but I'll leave that up to @yoeo to decide.

TylerLeonhardt commented 2 years ago

Ahh... I guess the first 100000 characters together the guess is Julia... if I do a random chunk of 100000 it does yield TS... maybe I should chunk it and run it and then average out the results?

The start of the string seems to impact the result quite a bit which might be interesting for you.

yoeo commented 2 years ago

The model actually only reads the first 10k characters, for performance reasons: https://github.com/yoeo/guesslang/blob/f4ceb1d3a39356eebd32abd8dd4d416d145e5a38/guesslang/model.py#L28

I guess that the repeated use bitwise operators and the relative lack of semicolons at the start of the file confused the model. In fact, less than 1% of the 27k Typescript files that I randomly picked to train the model use the << operator.

However, I don't quite understand why the model picked Julia with such a high confidence.

if I do a random chunk of 100000 it does yield TS.

That's a great idea, I'll try that on the Python version too and see if it can improve the model overall accuracy

TylerLeonhardt commented 2 years ago

The model actually only reads the first 10k characters, for performance reasons:

Is this on training or inference? if the first 10k is only getting used, then I should probably only make strings with 10k characters :)

yoeo commented 2 years ago

Is this on training or inference?

It reads the first 10k chars for both training & inference.

then I should probably only make strings with 10k characters

Absolutely.