yoeo / guesslang

Detect the programming language of a source code
https://guesslang.readthedocs.io
MIT License
770 stars 110 forks source link

Feature engineering #48

Open fsx950223 opened 2 years ago

fsx950223 commented 2 years ago

I have a question about feature engineering. Why do you use chars as inputs instead of words? For example,

Hello world!
<tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'H', b'e', b'l', b'l', b'o', b' ', b'w', b'o', b'r', b'l', b'd',
       b'!'], dtype=object)>
ngrams: <tf.Tensor: shape=(11,), dtype=string, numpy=
array([b'H e', b'e l', b'l l', b'l o', b'o  ', b'  w', b'w o', b'o r',
       b'r l', b'l d', b'd !'], dtype=object)>

is better than

<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'Hello', b'world', b'!'], dtype=object)>
ngrams: <tf.Tensor: shape=(2,), dtype=string, numpy=array([b'Hello world', b'world !'], dtype=object)>

?

fsx950223 commented 2 years ago

In order to use tflite model, you have to convert strings to token ids, such as 'He'-> 1332. @yoeo

yoeo commented 2 years ago

Hi @fsx950223

Why do you use chars as inputs instead of words?

In fact, I tested both chars and words with various preprocessing tricks and chose the one that gave the best predictions with the current model & training dataset. If one day I switch to a new machine learning model or change the way I build the training dataset, I'll have to test the different preprocessing options again and choose the best one -> and it could be "words" this time.

By the way, if you know any general rule about when to use chars or words for feature engineering, I'll be happy to learn and test it :slightly_smiling_face:

yoeo commented 2 years ago

In order to use tflite model, you have to convert strings to token ids, such as 'He'-> 1332.

In theory yes. You probably could use tflite by:

  1. hacking the model trained model to take integer input instead of the string ones
  2. extract the string -> integer mappings from the model
  3. convert the hacked trained model (without the mappings) to tflite
  4. use the extracted mappings to convert your input strings into integer inputs
  5. send the integer inputs to the new tflite model to generate predictions

I don't know if it will actually work, but if you find a way to make work, please share the details here https://github.com/yoeo/guesslang/issues/26

fsx950223 commented 2 years ago

For improving model performance, I recommend tf.keras.layers.TextVectorization + FastText model which is similar to the current model. For more details, taking a look at https://www.tensorflow.org/text/guide/word_embeddings