What is the bug mentioned in the Colab notebook?

Manuel-Medina commented 2 years ago

Hello, First of all, great job with the library! Congratulations! It is really useful and easy to use.

I was checking the Colab Notebook and saw this comment:

# we can loop through and look at each pairwise comparison to see whats going on
# except we can't trust the probabilities order, because of a bug in fitbert

Could you detail what the problem is, please? Does it happen only when using with_prob=True?

sam-writer commented 2 years ago

glad it is useful!

It has been a while, but IIRC the bug is that the probabilities are sorted before being returned

Manuel-Medina commented 2 years ago

Thank you for your reply!

I see. Does this mean that, in the current state, fitbert is not reliable to get the probabilities of candidates?

Is this bug something you plan to address?

sam-writer commented 2 years ago

it's possible the bug isn't there anymore. I looked at the code and don't see anything suspicious, but it has been a while. I would test it and see if they make sense. I do plan on rewriting fitbert, but it likely won't happen for... a while, sorry.

Manuel-Medina commented 2 years ago

I've taken a look at the code and found the places were the words are ranked:

rank_single

            ranked_pairs = (
                seq(words_ids)
                .map(lambda x: float(probs[0][target_idx][x].item()))
                .zip(words)
                .sorted(key=lambda x: x[0], reverse=True)
            )

rank_multi

        ranked_pairs = (
            seq(options)
            .map(lambda x: masked_sent.replace(self.mask_token, x))
            .map(lambda x: self._get_sentence_probability(x))
            .zip(options)
            .sorted(key=lambda x: x[0], reverse=True)
        )

If I'm reading this correctly, you get the probabilities for each word, create a list of tuples (prob, word) and sort it at the end by probability in descending order. This seems to me the correct flow:

Say you have

options = ['a', 'an']
masked_sent = 'I am ***mask*** student'

Then each option is replaced in the masked sentence, so you have a sequence like

['I am a student', 'I am an student']

Probabilities are retrieved for each sentence, say

[0.02343, 0.000000123]

zipping [(0.02343, 'a'), (0.000000123, 'an')]

and sorting by the first element of the tuple, which in this case will give the same sequence as result.

Am I looking at the correct place?

Manuel-Medina commented 2 years ago

it's possible the bug isn't there anymore. I looked at the code and don't see anything suspicious, but it has been a while. I would test it and see if they make sense. I do plan on rewriting fitbert, but it likely won't happen for... a while, sorry.

No problem! Thank you for your efforts! I'm interested in the library, so if I can find the bug and address it, I would like to submit a pull request to fix it 😃

sam-writer commented 2 years ago

yup, you're looking in the right place and are thinking about it right. so... add some print statements and use an option that really doesn't fit and see if anything looks off?

Manuel-Medina commented 2 years ago

Great! I will check it then and report what I could find.

Thank you!

writer / fitbert

What is the bug mentioned in the Colab notebook? #24