yoeo / guesslang

Detect the programming language of a source code
https://guesslang.readthedocs.io
MIT License
798 stars 114 forks source link

Dip in TS confidence #37

Open TylerLeonhardt opened 3 years ago

TylerLeonhardt commented 3 years ago

I'm not sure how much can be done here but I thought I'd start a discussion.

Here's the following TypeScript snippet:

function makeThing(): Thing {
    let size = 0;
    return {
        get size(): number {
            return size;
        },
        set size(value: string | number | boolean) {
            let num = Number(value);
            // Don't allow NaN and stuff.
            if (!Number.isFinite(num)) {
            size = 0;
            return;
            }
            size = num;
        },
    };
}

Which yields the following confidence:

  { languageId: 'ts', confidence: 0.2830791473388672 },
  { languageId: 'rs', confidence: 0.10346205532550812 },
  { languageId: 'js', confidence: 0.09687642008066177 },
  { languageId: 'lua', confidence: 0.060949232429265976 },
  { languageId: 'cs', confidence: 0.052387114614248276 },
  { languageId: 'go', confidence: 0.04457798972725868 }
 // ...

Before #33, the confidence for TS was way over 40%. I'm currently saying "this file is a TS file if the model is at least 20% more confident than the next language" but unfortunately, this fails.

I'm a bit nervous to drop that 20% down any further...

Also interesting that Rust beat out JS...

yoeo commented 3 years ago

Just before the merge, I updated the model that was in the PR with a better trained one https://github.com/yoeo/guesslang/pull/33/commits/198352a0027199f29c995afe9db5a66dd9403e99 . Maybe you're using the model that was pushed just before this change.

If that's the case, you can now find the lastest model in the main branch https://github.com/yoeo/guesslang/tree/master/guesslang/data/model It is still not as precise as the 30-languages model, but and it produces better results with your Typescript example:

 ✓ echo $'function makeThing(): Thing {
        let size = 0;
        return {
                get size(): number {
                    return size;
                },
                set size(value: string | number | boolean) {
                    let num = Number(value);
                    // Don\'t allow NaN and stuff.
                    if (!Number.isFinite(num)) {
                        size = 0;
                        return;
                    }
                    size = num;
                },
        };
}' | guesslang -p
Language name       Probability
 TypeScript           32.21%
 JavaScript            9.24%
 Rust                  7.14%
 C#                    5.80%
 C                     4.71%
 Lua                   4.59%
yoeo commented 3 years ago

@TylerLeonhardt, just out of curiosity, how do you convert the model to TensorflowJS?

I tried with the current stable version of tensorflowjs, with no special options, and got an error:

tensorflowjs_converter --input_format=tf_saved_model ./guesslang/data/model /tmp/web_model

...
E tensorflow/core/grappler/grappler_item_builder.cc:669] Init node head/predictions/class_string_lookup/table_init/LookupTableImportV2 doesn't exist in graph
...
Instructions for updating:
Use `tf.compat.v1.graph_util.extract_sub_graph`
Traceback (most recent call last):
  File ".../bin/tensorflowjs_converter", line 8, in <module>
    sys.exit(pip_main())
  File ".../lib64/python3.9/site-packages/tensorflowjs/converters/converter.py", line 813, in pip_main
    main([' '.join(sys.argv[1:])])
  File ".../lib64/python3.9/site-packages/tensorflowjs/converters/converter.py", line 817, in main
    convert(argv[0].split(' '))
  File ".../lib64/python3.9/site-packages/tensorflowjs/converters/converter.py", line 803, in convert
    _dispatch_converter(input_format, output_format, args, quantization_dtype_map,
  File ".../lib64/python3.9/site-packages/tensorflowjs/converters/converter.py", line 523, in _dispatch_converter
    tf_saved_model_conversion_v2.convert_tf_saved_model(
  File ".../lib64/python3.9/site-packages/tensorflowjs/converters/tf_saved_model_conversion_v2.py", line 683, in convert_tf_saved_model
    optimize_graph(frozen_graph, signature,
  File ".../lib64/python3.9/site-packages/tensorflowjs/converters/tf_saved_model_conversion_v2.py", line 153, in optimize_graph
    raise ValueError('Unsupported Ops in the model before optimization\n' +
ValueError: Unsupported Ops in the model before optimization
OptionalNone, ReadVariableOp, OptionalFromValue
TylerLeonhardt commented 3 years ago

@pyu10055 gave me this pointer in https://github.com/tensorflow/tfjs/issues/4838#issuecomment-866416464

tensorflowjs_converter --input_format=tf_saved_model --skip_op_check model web_model
TylerLeonhardt commented 3 years ago

Maybe you're using the model that was pushed just before this change.

hmm I grabbed https://github.com/yoeo/guesslang/tree/master/guesslang/data/model this morning actually so I'm fairly certain I have the correct one... I wonder if there's a loss in confidence during the conversion to the tfjs model 🤔

yoeo commented 3 years ago

Cool the conversion now works.

I can see that the converter prints messages about various optimisations. I especially suspect that the int64 to int32 conversion have an impact the model accuracy.

TylerLeonhardt commented 3 years ago

I especially suspect that the int64 to int32 conversion have an impact the model accuracy.

Maybe @pyu10055 has guidance here? Or perhaps @dynamicwebpaige?

pyu10055 commented 3 years ago

@TylerLeonhardt @yoeo We do convert the int64 to int32, but those are not weight related if I understand correctly, most of them are ids. The missing ops errors can be ignored, since those are maybe from some of the training functions not used in the inference graph.

yoeo commented 3 years ago

The missing ops errors can be ignored, since those are maybe from some of the training functions not used in the inference graph.

@pyu10055 OK.

During the training phase I do use I/O functions to read & process the examples and as you spotted these functions are not used for inference.