NER returns words with padding chars '#'

Hi, I'm using latest Weaviate 1.16.6 with NER module set via Docker

ner-transformers:
    image: semitechnologies/ner-transformers:dbmdz-bert-large-cased-finetuned-conll03-english
    environment:
      ENABLE_CUDA: '0'

I'm getting somewhat unexpected what I would call "partial" tokens as result for e.g. an input document

"Peaceful protest - India - Bengaluru"

it returns (relevant part only)

{
                    "_additional": {
                        "tokens": [
                            {
                                "certainty": 0.996863603591919,
                                "endPosition": 24,
                                "entity": "I-LOC",
                                "property": "name",
                                "startPosition": 19,
                                "word": "India"
                            },
                            {
                                "certainty": 0.9994908571243286,
                                "endPosition": 33,
                                "entity": "I-LOC",
                                "property": "name",
                                "startPosition": 27,
                                "word": "Bengal"
                            },
                            {
                                "certainty": 0.9993565678596497,
                                "endPosition": 36,
                                "entity": "I-LOC",
                                "property": "name",
                                "startPosition": 33,
                                "word": "##uru"
                            }
                        ]
                    }

Putting the same string into Huggingface model UI at least renders "Bengaluru" as one single token being NER tagged.

I asked in Slack and got response from @laura-ham

I think this has to do with entity grouping. There is a parameter introduced to the huggingface model called “grouped_entities”, which can be set to True or False. My guess is that this NER module in Weaviate doesn’t make use of this parameter, which is why the entities are split. See here for more info: https://github.com/huggingface/transformers/pull/3957

I checked the code and @laura-ham is right. One has to explicitly enable grouped_entities during pipeline creation or even better in latest API set the aggregation_strategy . That alone is not enough: when going for latest version 4.25.1, the response format will slightly being changed when grouping was enabled: https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/token_classification.py#L202-L203

So, instead of entity the NER tag will be contained in entity_group key.

The question now is:

Go to latest 4.25.1
Should we enable grouping by default? - if not users will have to aggregate the words later on by their selves

Which aggregation strategy to prefer? There are currently five options:

The strategy to fuse (or not) tokens based on the model prediction.
- "none" : Will simply not do any aggregation and simply return raw results from the model
- "simple" : Will attempt to group entities following the default schema. (A, B-TAG), (B, I-TAG), (C,
              I-TAG), (D, B-TAG2) (E, B-TAG2) will end up being [{"word": ABC, "entity": "TAG"}, {"word": "D",
              "entity": "TAG2"}, {"word": "E", "entity": "TAG2"}] Notice that two consecutive B tags will end up as
              different entities. On word based languages, we might end up splitting words undesirably : Imagine
              Microsoft being tagged as [{"word": "Micro", "entity": "ENTERPRISE"}, {"word": "soft", "entity":
              "NAME"}]. Look for FIRST, MAX, AVERAGE for ways to mitigate that and disambiguate words (on languages
              that support that meaning, which is basically tokens separated by a space). These mitigations will
              only work on real words, "New york" might still be tagged with two different entities.
- "first" : (works only on word based models) Will use the `SIMPLE` strategy except that words, cannot
              end up with different tags. Words will simply use the tag of the first token of the word when there
              is ambiguity.
- "average" : (works only on word based models) Will use the `SIMPLE` strategy except that words,
              cannot end up with different tags. scores will be averaged first across tokens, and then the maximum
              label is applied.
- "max" : (works only on word based models) Will use the `SIMPLE` strategy except that words, cannot
              end up with different tags. Word entity will simply be the token with the maximum score.

weaviate / ner-transformers-models

NER returns words with padding chars '#' #7