xigt / lgid

language identification of linguistic examples
MIT License
1 stars 0 forks source link

Getting the best language match from predictions #6

Open goodmami opened 7 years ago

goodmami commented 7 years ago

Currently (when the code works), it only returns the True/False prediction and its score (as model.Distribution objects). It may be the case that more than one, or none, of the languages are chosen as True. The score of the prediction should be used to rank the list of languages for a span, then use that one for the final prediction.

MackieBlackburn commented 7 years ago

Should this be done by modifying the test() function in main.py?

goodmami commented 7 years ago

Yeah I suppose. Here's the relevant code block in the test() function:

for dist in model.test(instances):
    print(dir(dist))
    print(dist.classes())

You could write a function to normalize the values (e.g. set the one with the highest confidence of a False value to 0, the highest confidence of a True value to 1, and scale everything else accordingly. Then replace the code block above with something like:

ranked_list = normalize_probabilities(model.test(instances))
if len(ranked_list) != 0:
    top = ranked_list[0]
    ...
MackieBlackburn commented 7 years ago

Reviewing the code, it looks like the model.test() function returns a Distribution object, which contains a dictionary of class to probability. Each Distribution object also has a best_class field, so if I'm not mistaken this issue might be solved by doing

for dist in model.test(instances):
   print(dir(dist))
   top = dist.best_class

I can put some normalization code into the Distribution class to make sure the probabilities are normalized.

goodmami commented 7 years ago

Hmm, possibly. I didn't write models.py, but I thought it returned a distribution for each language, and the classes were True and False, so if something had a high probability for False, best_class would return False for the language that distribution was made for.

I could be wrong though.