Then we have an issue: the first two options are one token, but the second is two. That means we will get a likeliness score for buff (call it L(buff)), another for handsome (L(handsome)), but two for quite strong (L(quite), L(strong)).
While we can directly compare L(buff) to L(handsome), what do we do with L(quite) and L(strong)?
On the lines around this one, we use max to make likelihoods comparable across different token lengths. This may be wrong.
For this ticket, try to come up with good test cases, to see how max performs. Then, if max doesn't seem to work well, try to use other aggregation methods... if there is a theoretical basis for the choice, all the better.
The way this algorithm works is by asking Bert for the likelihood of a token, given the surrounding tokens. But, if you have this situation:
Then we have an issue: the first two options are one token, but the second is two. That means we will get a likeliness score for
buff
(call itL(buff)
), another forhandsome
(L(handsome)
), but two forquite strong
(L(quite)
,L(strong)
).While we can directly compare
L(buff)
toL(handsome)
, what do we do withL(quite)
andL(strong)
?On the lines around this one, we use
max
to make likelihoods comparable across different token lengths. This may be wrong.For this ticket, try to come up with good test cases, to see how
max
performs. Then, ifmax
doesn't seem to work well, try to use other aggregation methods... if there is a theoretical basis for the choice, all the better.