weaviate / ner-transformers-models

The inference container for the Weaviate NER transformers module
BSD 3-Clause "New" or "Revised" License
6 stars 1 forks source link

NER returns words with padding chars '#' #7

Open LorenzBuehmann opened 1 year ago

LorenzBuehmann commented 1 year ago

Hi, I'm using latest Weaviate 1.16.6 with NER module set via Docker

ner-transformers:
    image: semitechnologies/ner-transformers:dbmdz-bert-large-cased-finetuned-conll03-english
    environment:
      ENABLE_CUDA: '0'

I'm getting somewhat unexpected what I would call "partial" tokens as result for e.g. an input document

"Peaceful protest - India - Bengaluru"

it returns (relevant part only)

{
                    "_additional": {
                        "tokens": [
                            {
                                "certainty": 0.996863603591919,
                                "endPosition": 24,
                                "entity": "I-LOC",
                                "property": "name",
                                "startPosition": 19,
                                "word": "India"
                            },
                            {
                                "certainty": 0.9994908571243286,
                                "endPosition": 33,
                                "entity": "I-LOC",
                                "property": "name",
                                "startPosition": 27,
                                "word": "Bengal"
                            },
                            {
                                "certainty": 0.9993565678596497,
                                "endPosition": 36,
                                "entity": "I-LOC",
                                "property": "name",
                                "startPosition": 33,
                                "word": "##uru"
                            }
                        ]
                    }

Putting the same string into Huggingface model UI at least renders "Bengaluru" as one single token being NER tagged.

LorenzBuehmann commented 1 year ago

I asked in Slack and got response from @laura-ham

I think this has to do with entity grouping. There is a parameter introduced to the huggingface model called “grouped_entities”, which can be set to True or False. My guess is that this NER module in Weaviate doesn’t make use of this parameter, which is why the entities are split. See here for more info: https://github.com/huggingface/transformers/pull/3957

I checked the code and @laura-ham is right. One has to explicitly enable grouped_entities during pipeline creation or even better in latest API set the aggregation_strategy . That alone is not enough: when going for latest version 4.25.1, the response format will slightly being changed when grouping was enabled: https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/token_classification.py#L202-L203

So, instead of entity the NER tag will be contained in entity_group key.

LorenzBuehmann commented 1 year ago

The question now is:

  1. Go to latest 4.25.1
  2. Should we enable grouping by default? - if not users will have to aggregate the words later on by their selves
  3. Which aggregation strategy to prefer? There are currently five options:
    The strategy to fuse (or not) tokens based on the model prediction.
    - "none" : Will simply not do any aggregation and simply return raw results from the model
    - "simple" : Will attempt to group entities following the default schema. (A, B-TAG), (B, I-TAG), (C,
                  I-TAG), (D, B-TAG2) (E, B-TAG2) will end up being [{"word": ABC, "entity": "TAG"}, {"word": "D",
                  "entity": "TAG2"}, {"word": "E", "entity": "TAG2"}] Notice that two consecutive B tags will end up as
                  different entities. On word based languages, we might end up splitting words undesirably : Imagine
                  Microsoft being tagged as [{"word": "Micro", "entity": "ENTERPRISE"}, {"word": "soft", "entity":
                  "NAME"}]. Look for FIRST, MAX, AVERAGE for ways to mitigate that and disambiguate words (on languages
                  that support that meaning, which is basically tokens separated by a space). These mitigations will
                  only work on real words, "New york" might still be tagged with two different entities.
    - "first" : (works only on word based models) Will use the `SIMPLE` strategy except that words, cannot
                  end up with different tags. Words will simply use the tag of the first token of the word when there
                  is ambiguity.
    - "average" : (works only on word based models) Will use the `SIMPLE` strategy except that words,
                  cannot end up with different tags. scores will be averaged first across tokens, and then the maximum
                  label is applied.
    - "max" : (works only on word based models) Will use the `SIMPLE` strategy except that words, cannot
                  end up with different tags. Word entity will simply be the token with the maximum score.