yoeo / guesslang

Detect the programming language of a source code
https://guesslang.readthedocs.io
MIT License
773 stars 110 forks source link

Support 24 more languages, including JSON, Kotlin, XML, YAML etc... #33

Closed yoeo closed 2 years ago

yoeo commented 3 years ago

Support the following languages:

Prediction accuracy is 92.59% but the training and test dataset were not well balanced due to lack of files for some languages. And there were errors in the Pascal dataset.

yoeo commented 3 years ago

Prediction results with 167k test files: image

TylerLeonhardt commented 2 years ago

This is great @yoeo! I did notice some decrease in confidence for Java. The following snippet use to have over 60% confidence:

public class PositiveNegative {

    public static void main(String[] args) {

        double number = 12.3;

        // true if number is less than 0
        if (number < 0.0)
            System.out.println(number + " is a negative number.");

        // true if number is greater than 0
        else if ( number > 0.0)
            System.out.println(number + " is a positive number.");

        // if both test expression is evaluated to false
        else
            System.out.println(number + " is 0.");
    }
}

but using this branch, it's down to 20% confident it's Java. My guess is that the introduction of Groovy hurt the confidence?

yoeo commented 2 years ago

Nice catch @TylerLeonhardt. You're probably right about the effects of Groovy support on Java detection.

This model is still "work in progress" and I hope that training it with more examples and for a longer time will help improve its predictions.

TylerLeonhardt commented 2 years ago

@yoeo the JSON and YAML predictions were great, btw. Such a game changer :)

I hope to have this in a VS Code Insider release either this week or next. Exciting times!

yoeo commented 2 years ago

Hi, I updated the model. It now uses a way more balanced and clean dataset. It also supports even more languages than before (44 → 53 languages). :warning: But this model is barely trained :warning: I still need to train it for many hours and maybe tweak it a little to improve its accuracy before merging it.

image

yoeo commented 2 years ago

@TylerLeonhardt

I investigated on the confidence drop that you noticed. Indeed, adding more languages hurts the prediction confidence. Fortunately, the model still assigns the highest probability value to the correct language 91% of time.

For example, here is are box plots of the probabilities that I got by testing 5k Java files:

We can see that the addition of Groovy and Dart hurts Java detection confidence, but almost all the time the files are still correctly detected as Java files.

The probability plots for all the languages are available here:

TylerLeonhardt commented 2 years ago

@yoeo this is amazing work! I was just thinking yesterday that rather than saying "confidence over 60% is the winner" it should instead be relative to every other confidence.

For example: 30% Java and <1% everything else means it's probably Java.

I don't know if 30%/1% is the best pair of numbers...but I'll give it a go. I'm open to suggestions from you since you're the expert 😃

yoeo commented 2 years ago

Hi @TylerLeonhard

The model is now fully trained. Its overall accuracy is pretty good ~93.5% (the original model accuracy was ~93.8%) The confidence scores increased a bit compared to the untrained model that I pushed earlier. For example, your sample code is now detected with ~41% confidence:

echo "public class PositiveNegative {
....
}" | guesslang --probabilities
Language name       Probability
 Java                 41.63%
 Groovy               24.83%
 C#                    6.17%
 ...

I'm pretty happy with these results and I'll merge this PR after updating the documentation.


I was just thinking yesterday that rather than saying "confidence over 60% is the winner" it should instead be relative to every other confidence. For example: 30% Java and <1% everything else means it's probably Java.

You're perfectly right I think. In fact I use a variant of this solution to check if there is a clear winner or not: https://github.com/yoeo/guesslang/blob/cbc441d6a3c5512217b503844cb4cd62b3664e39/guesslang/guess.py#L160-L168

And to be honest, I stole the whole thing from Wikipedia https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule :slightly_smiling_face:

Thanks.

TylerLeonhardt commented 2 years ago

And to be honest, I stole the whole thing from Wikipedia https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule 🙂

😁 interesting! Thanks for sharing. I think I'll try to make sure my solution aligns with that and with what you're already doing.

Excited to see this change go in!