Open LorenzBuehmann opened 1 year ago
I asked in Slack and got response from @laura-ham
I think this has to do with entity grouping. There is a parameter introduced to the huggingface model called “grouped_entities”, which can be set to True or False. My guess is that this NER module in Weaviate doesn’t make use of this parameter, which is why the entities are split. See here for more info: https://github.com/huggingface/transformers/pull/3957
I checked the code and @laura-ham is right. One has to explicitly enable grouped_entities
during pipeline creation or even better in latest API set the aggregation_strategy
.
That alone is not enough: when going for latest version 4.25.1, the response format will slightly being changed when grouping was enabled: https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/token_classification.py#L202-L203
So, instead of entity
the NER tag will be contained in entity_group
key.
The question now is:
The strategy to fuse (or not) tokens based on the model prediction.
- "none" : Will simply not do any aggregation and simply return raw results from the model
- "simple" : Will attempt to group entities following the default schema. (A, B-TAG), (B, I-TAG), (C,
I-TAG), (D, B-TAG2) (E, B-TAG2) will end up being [{"word": ABC, "entity": "TAG"}, {"word": "D",
"entity": "TAG2"}, {"word": "E", "entity": "TAG2"}] Notice that two consecutive B tags will end up as
different entities. On word based languages, we might end up splitting words undesirably : Imagine
Microsoft being tagged as [{"word": "Micro", "entity": "ENTERPRISE"}, {"word": "soft", "entity":
"NAME"}]. Look for FIRST, MAX, AVERAGE for ways to mitigate that and disambiguate words (on languages
that support that meaning, which is basically tokens separated by a space). These mitigations will
only work on real words, "New york" might still be tagged with two different entities.
- "first" : (works only on word based models) Will use the `SIMPLE` strategy except that words, cannot
end up with different tags. Words will simply use the tag of the first token of the word when there
is ambiguity.
- "average" : (works only on word based models) Will use the `SIMPLE` strategy except that words,
cannot end up with different tags. scores will be averaged first across tokens, and then the maximum
label is applied.
- "max" : (works only on word based models) Will use the `SIMPLE` strategy except that words, cannot
end up with different tags. Word entity will simply be the token with the maximum score.
Hi, I'm using latest Weaviate 1.16.6 with NER module set via Docker
I'm getting somewhat unexpected what I would call "partial" tokens as result for e.g. an input document
"Peaceful protest - India - Bengaluru"
it returns (relevant part only)
Putting the same string into Huggingface model UI at least renders "Bengaluru" as one single token being NER tagged.