yoeo / guesslang

Detect the programming language of a source code
https://guesslang.readthedocs.io
MIT License
773 stars 110 forks source link

Detect compact JSON #42

Open supersonicclay opened 2 years ago

supersonicclay commented 2 years ago

Repro with latest VS Code.

Example:

{"name":"John", "age":30, "car":null}

Result: Plain text

Expected: JSON

I can also repro with a very large JSON file.

yoeo commented 2 years ago

Hello @supersonicclay, This code snippet is very short and the model is not well suited for small snippets. More about that here https://guesslang.readthedocs.io/en/latest/contents.html#limitations

I can also repro with a very large JSON file.

Could you provide an example file ? That may help improve the model.

AndydeCleyre commented 2 years ago

In my project that uses guesslang (it works great!), I wrap it with a very simple fallback guesser based on just the first few characters:

def guess_ext(self, code: str, probability_min: float = 0.12) -> Optional[str]:

    syntax, probability = self.guesser.probabilities(code)[0]
    ext = self.guesslang_syntaxes.get(syntax)

    if probability >= probability_min:
        return ext
    for start, ext in {
        '{':     'json',
        '---\n': 'yaml',
        '[[':    'toml', '[': 'ini',
        '<?php': 'php',  '<': 'xml',
        '-- ':   'lua'
    }.items():
        if code.startswith(start):
            return ext
supersonicclay commented 2 years ago

I can also repro with a very large JSON file.

Could you provide an example file ? That may help improve the model.

I actually just tried it with a larger JSON snippet in a single line and it recognized as JSON. It just took about a second, and I think I was going to fast before and thought it wasn't recognizing.

yoeo commented 2 years ago

@AndydeCleyre

I wrap it with a very simple fallback guesser based on just the first few characters

That's a very good fallback idea.

Maybe I can try to increase Guesslang machine learning model accuracy by making it to pay more attention to patterns like the ones you defined :-)

yoeo commented 2 years ago

@supersonicclay

it recognized as JSON

Nice.

It just took about a second

Yes, the prediction can take some time especially when you're using the command line tool. However, you can use the Python API to make faster predictions:

# Setup everything, it can take seconds depending on your hardware configuration,
# but you only have to do it once.
from guesslang import Guess
guess = Guess()

# Then, run your predictions,
# the predictions will be computed really fast.
for code_snippet in my_code_snippets_list:
    result = guess.language_name(code_snippet)
    ...