Why replace subject and object entity with special token?

Maybewuss commented 5 years ago

In the paper 4.2 section, you say this precessing step helps (1) provide a model with entity type information, and (2) pre- vent a model from overfitting its predictions to specific entities.But i think it may lead overfitting easily instead of preventing because the original subject entity and object entity should be various, while after this precessing, they are masked to some special tokens and whether the number or the form of special tokens is not as ample as original tokens. So why this precessing step can prevent a model from overfitting? Anther question is if I feed a new text to this model, I have to recognize name entity in the text, how to do that if I don't use Stanford CoreNLP? By the way, can this model be applied to other datasets or real-life scenario? Thanks and God bless.

yuhaozhang commented 5 years ago

Hi @Maybewuss, to answer your questions:

(1) Entity masking and overfitting issue: I think an example may better illustrate this:

(Albert Einstein) was born in [Germany] in 1879.

Here let's say that "Albert Einstein" is the subject, "Germany" the object entity, and a "per:country_of_birth" relation between them. A model directly trained on this data without entity masking may learn the correlation (or so-called "dataset bias") between "Einstein" or "Germany" and the output relation "per:country_of_birth", instead of solely focusing on the textual evidence "was born in". A dangerous outcome of learning this correlation is that at test time, the model may predict "per:country_of_birth" whenever it sees "Einstein" and "Germany" and vice versa, which is not what we want. In fact data biases like this exists in almost any dataset, including TACRED.

On the contrary, when entity masking is applied, the above example becomes:

(Subject-PERSON) was born in [OBJ-COUNTRY] in 1879.

Now neither the actual subject or object entity is exposed to the model during training. As a result, the model has to focus on the entity types and the textual evidence "was born in" to make its prediction, resulting in better generalizability to unseen examples at test time.

(2) Data processing pipeline: I think https://github.com/yuhaozhang/tacred-relation/issues/9 may be helpful?

Maybewuss commented 5 years ago

Thanks. But when entity masking applied, how do you ensure that the model's concerned is the textual evidence "was born in" instead of "(Subject-PERSON)" and "[OBJ-COUNTRY]".Maybe at test time, the model always predict "per:country_of_birth" whenever it sees "(Subject-PERSON)" and "[OBJ-COUNTRY]" . In other word, if there are many sentences which contain "(Albert Einstein) was born in [Germany] in 1879." in the training data, the model may overfit, but I think the sentence contain "Einstein" and "Germany" can't entail a "per:country_of_birth" relation between them. So why the model learn the 'dataset bias'.

yuhaozhang commented 5 years ago

Imagine after deployment your model sees such an example:

(Krüger) was born in [Germany] in 1980.

Now since "Krüger" was never seen during training, it might be recognized by many neural models as an \<UNK> token. As a result, models that were trained without entity masking may fail to generalize to this example. However, with entity masking this example will appear just as the previous example seen during training time.

In fact we ran an experiment like this with the SemEval dataset in Sec 6.4 of this paper. In short, we trained a relation extraction model without entity masking and at test time, we replaced all subj and obj entities with \<UNK> tokens to simulate the scenario where the entities were never seen during training. The model's test performance drops drastically from 83.6 F1 to 62.4, showing a hard time generalizing to sentences with unseen entities.

yuhaozhang / tacred-relation

Why replace subject and object entity with special token? #11