Questions about the paper

Serega6678 commented 7 months ago

Hey guys, great work! Thank you for publishing the paper. Very impressed with your results, especially for 250M and 780M models - they look super cool!

I've got several questions:

Am I right, that in your opinion, you got better results than GoLLIE because:
- Your prompt is better than GoLLIE and you found out that is's important to generate an entire output (aka full contextual lengths, instead of what they are doing), and you made this entire algorithm actually viable thanks to LCS, etc
- You resolve the problem of Back Tokenization (which also adds some additional performance)
- Did I forget something???
When reading the paper, I got the impression that the Hierarchical Matching algorithm can be replaced with teacher forcing - you just make the model generate the correct word (if it's time to generate the next word of the sentence) or you force the model to make entity prediction aka generate "(" then some entity and then ")". Why did you do the "Hierarchical Matching Algorithm", am I missing something?

Thank you very much for your response

P.S. I am also working on NER-pretraining using artificial data. But I am primarily interested in pre-training BERT-like encoders with great token-level embeddings so mostly Feature Extraction task. Recently we released our models and got the SOTA few-shot results for NER. You can find them and the paper here: https://huggingface.co/collections/numind/paper-65e1f6e14639e2a465af823b

yyDing1 commented 7 months ago

Thank you for your appreciation of our contributions.

Your grasp of our methodology aligns closely with our intentions. To put it more directly, in terms of the final $F_1$ score, the improvements over previous approaches are primarily reflected in:

Recall: The model is guided to make judgments on every token in a sentence (including those in non-entity texts), which helps recall more entities.
Precision: The context of an entity often plays a crucial role in its identification; for example, words like "go to", "travel to" or "run to" are often followed by a location name. Including contextual information into training is also beneficial.

We observed improvements in both recall and precision, which in turn led to an increase in the $F_1$ score.

The teacher-forcing method you mentioned is indeed a good approach. Similarly, during our experiment, we attempted to construct a rule-based constraint decoding method, where only the correct word would be considered during inference. In this process, we made the following observations about the rule-based constraint decoding method:

Its effectiveness is comparable to that of the Hierarchical Matching algorithm, as reflected by the similar $F_1$ score.
The efficiency of inference is lower, and it's challenging to implement an efficient batched inference algorithm.

Given that auto-regressive generation itself is not very efficient, we prefer an offline and faster method, which is why we ultimately chose such a optimized matching algorithm.

I'm quite interested in your work, and gain more insights in dataset creation after reading your paper. NuNER also provides better pre-trained models for NER applications to choose from. Thank you for your contributions in terms of data, methods and pre-trained models!

Serega6678 commented 7 months ago

@yyDing1 Thank you very much for sharing your intuition about your work :)

Usually, papers are so formal and crammed with information, that it's difficult to get the intuition behind some actions. Very happy I was able to understand yours from your paper :)

Good luck with your future research! Looking forward to it

Serega6678 commented 7 months ago

Btw feel free to use the dataset from our paper - it's like Pile but like 15 times bigger and contains slightly other domains and texts (as it's based on C4). Maybe you can even merge it with Pile and train the model on it

https://huggingface.co/datasets/numind/NuNER

yyDing1 commented 7 months ago

It's really rewarding to hear that the intuition behind our research was clear and accessible. That's always a goal we strive for and your feedback means a lot.

We are excited about the prospect of developing a stronger generative Named Entity Recognition (NER) model based on the larger dataset that covers a broader range of domains. This is already under progress, and this model will be released soon.

Best wishes for your ongoing and future projects as well.

yyDing1 commented 6 months ago

Hello, we have trained a more powerful model by mixing the data from Pile-NER and NuNER, achieving an $F_1$ score of 64.12 in a zero-shot setting, an improvement of 0.6 compared to using Pile-NER alone. Moreover, benefiting from the broader coverage of NuNER, we notice that the model can recoginize more entity categories accurately.

GNER-T5-large-v2 are released here. More models will be released soon.

Serega6678 commented 6 months ago

@yyDing1 thank you very much for the update

I am currently also trying to combine NuNER and Pile datasets and I noticed that due to length and domain differences, it yields noticeable gains, like 1-2%. I tried cutting NuNER to the size of PileNER, to the half, and to the double size. So far, 1 to 1 seems to be the best (almost the same as x2. The NuNER[:len(PileNER)//2] performs almost as PileNER so almost no gains).

Did your results change much? What did you try to do? Did you do any additional data processing or only combine the data?

yyDing1 commented 6 months ago

I combined all the data and performed some data processing work, primarily focusing on external entity sampling: I sampled categories of entities that were not present in the sentences as additional queries. Thanks for your advice; I will optimize the data distribution.

Some additional thoughts

Given NuNER's extensive coverage of entity types, a broader benchmark would be more suitable for evaluation, perhaps one that facilitates the open-ended generation of entity types, where the predicted entities are not limited to a pre-defined set.

Furthermore, I noticed the data quality in the NuNER dataset varied, with numerous entities not being extracted in some low-quality samples. I am also interested in how to extract high-quality data from this large-scale dataset.

yyDing1 / GNER

Questions about the paper #2