Closed yoeo closed 2 years ago
Prediction results with 167k test files:
This is great @yoeo! I did notice some decrease in confidence for Java. The following snippet use to have over 60% confidence:
public class PositiveNegative {
public static void main(String[] args) {
double number = 12.3;
// true if number is less than 0
if (number < 0.0)
System.out.println(number + " is a negative number.");
// true if number is greater than 0
else if ( number > 0.0)
System.out.println(number + " is a positive number.");
// if both test expression is evaluated to false
else
System.out.println(number + " is 0.");
}
}
but using this branch, it's down to 20% confident it's Java. My guess is that the introduction of Groovy hurt the confidence?
Nice catch @TylerLeonhardt. You're probably right about the effects of Groovy support on Java detection.
This model is still "work in progress" and I hope that training it with more examples and for a longer time will help improve its predictions.
@yoeo the JSON and YAML predictions were great, btw. Such a game changer :)
I hope to have this in a VS Code Insider release either this week or next. Exciting times!
Hi, I updated the model. It now uses a way more balanced and clean dataset. It also supports even more languages than before (44 → 53 languages). :warning: But this model is barely trained :warning: I still need to train it for many hours and maybe tweak it a little to improve its accuracy before merging it.
@TylerLeonhardt
I investigated on the confidence drop that you noticed. Indeed, adding more languages hurts the prediction confidence. Fortunately, the model still assigns the highest probability value to the correct language 91% of time.
For example, here is are box plots of the probabilities that I got by testing 5k Java files:
using the model that is on the main branch
and using the model that is on this PR
We can see that the addition of Groovy and Dart hurts Java detection confidence, but almost all the time the files are still correctly detected as Java files.
The probability plots for all the languages are available here:
@yoeo this is amazing work! I was just thinking yesterday that rather than saying "confidence over 60% is the winner" it should instead be relative to every other confidence.
For example: 30% Java and <1% everything else means it's probably Java.
I don't know if 30%/1% is the best pair of numbers...but I'll give it a go. I'm open to suggestions from you since you're the expert 😃
Hi @TylerLeonhard
The model is now fully trained. Its overall accuracy is pretty good ~93.5% (the original model accuracy was ~93.8%) The confidence scores increased a bit compared to the untrained model that I pushed earlier. For example, your sample code is now detected with ~41% confidence:
echo "public class PositiveNegative {
....
}" | guesslang --probabilities
Language name Probability
Java 41.63%
Groovy 24.83%
C# 6.17%
...
I'm pretty happy with these results and I'll merge this PR after updating the documentation.
I was just thinking yesterday that rather than saying "confidence over 60% is the winner" it should instead be relative to every other confidence. For example: 30% Java and <1% everything else means it's probably Java.
You're perfectly right I think. In fact I use a variant of this solution to check if there is a clear winner or not: https://github.com/yoeo/guesslang/blob/cbc441d6a3c5512217b503844cb4cd62b3664e39/guesslang/guess.py#L160-L168
And to be honest, I stole the whole thing from Wikipedia https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule :slightly_smiling_face:
Thanks.
And to be honest, I stole the whole thing from Wikipedia https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule 🙂
😁 interesting! Thanks for sharing. I think I'll try to make sure my solution aligns with that and with what you're already doing.
Excited to see this change go in!
Support the following languages:
Prediction accuracy is 92.59% but the training and test dataset were not well balanced due to lack of files for some languages. And there were errors in the Pascal dataset.