yoeo / guesslang

Detect the programming language of a source code
https://guesslang.readthedocs.io
MIT License
798 stars 114 forks source link

Confidence Score #14

Closed malahmadi1 closed 4 years ago

malahmadi1 commented 4 years ago

Can I obtain a confidence score for each guess?

Thanks, Mohammad

yoeo commented 4 years ago

Hello @malahmadi1

Yes.

Now you have guess.probabilities(source_code) that gives you the probability for each language. The first probability can be used as a confidence.

Example:

from pprint import pprint
from guesslang import Guess

guess = Guess()

source_code = """
    % Quick sort

    -module (recursion).
    -export ([qsort/1]).

    qsort([]) -> [];
    qsort([Pivot|T]) ->
          qsort([X || X <- T, X < Pivot])
          ++ [Pivot] ++
          qsort([X || X <- T, X >= Pivot]).
"""

probabilities = guess.probabilities(source_code)

pprint(probabilities)

# Prints the following list:
[('Erlang', 0.7835302948951721),
 ('Markdown', 0.04323587566614151),
 ('Haskell', 0.03866693750023842),
 ('R', 0.027196552604436874),
 ('Shell', 0.018274815753102303),
 ('TeX', 0.01470933761447668),
 ('Matlab', 0.011461317539215088),
 ('JavaScript', 0.011054938659071922),
 ('Scala', 0.00946285855025053),
 ('Perl', 0.004884811118245125),
 ('C++', 0.0037576162721961737),
 ('HTML', 0.003617567475885153),
 ('Rust', 0.0034391707740724087),
 ('Swift', 0.003167798975482583),
 ('Ruby', 0.0029752985574305058),
 ('C', 0.0025446831714361906),
 ('Objective-C', 0.0023764485958963633),
 ('Python', 0.0020462924148887396),
 ('CoffeeScript', 0.001939206849783659),
 ('Java', 0.0018487452762201428),
 ('Lua', 0.00134648394305259),
 ('Jupyter Notebook', 0.0012952813412994146),
 ('C#', 0.0011168558849021792),
 ('Go', 0.0011112005449831486),
 ('Batchfile', 0.00104153947904706),
 ('TypeScript', 0.0009122725459747016),
 ('PHP', 0.0008308574906550348),
 ('SQL', 0.000818912812974304),
 ('PowerShell', 0.0007574326591566205),
 ('CSS', 0.0005784708191640675)]

Thank you for your feedback.