yanaiela / TNE

codebase for the Text-based NP Enrichment (TNE) paper
MIT License
19 stars 5 forks source link

Text-based NP Enrichment (TNE)

TNE is an NLU task, which focus on relations between noun phrases (NPs) that can be mediated via prepositions. The dataset contains 5,497 documents, annotated exhaustively with all possible links between the NPs in each document.

For more details check out our paper "Text-based NP Enrichment", and website.

Data

Load from Huggingface's Datasets Library

from datasets import load_dataset

dataset = load_dataset("tne")

Download

Data Format

The dataset is spread across four files, for the four different splits: train, dev, test and ood. Each file is in a jsonl format, containing a dictionary of a single document. A document consists of:

Getting Started

Install dependencies

conda create -n tne python==3.7 anaconda
conda activate tne

pip install -r requirements.txt

Models

We train the models using allennlp

To run the coupled-large model, run:

allennlp train tne/modeling/configs/coupled_large.jsonnet \
         --include-package tne \
         -s models/coupled_spanbert_large

After training a model (or using the trained one), you can get the predictions file using:

allennlp predict models/coupled_spanbert_large/model.tar.gz data/test.jsonl --output-file coupled_large_predictions.jsonl --include-package tne --use-dataset-reader --predictor tne_predictor

Trained Model

We release the best model we achieved: coupled-large and it can be downloaded here. If there's interest in other models from the paper, please let me know via email or open an issue, and I will upload them as well.

Citation


@article{tne,
    author = {Elazar, Yanai and Basmov, Victoria and Goldberg, Yoav and Tsarfaty, Reut},
    title = "{Text-based NP Enrichment}",
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {10},
    pages = {764-784},
    year = {2022},
    month = {07},
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00488},
    url = {https://doi.org/10.1162/tacl\_a\_00488},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00488/2037151/tacl\_a\_00488.pdf},
}

Submitting to the Leaderboard

To submit your model's prediction to the leaderboard, you need to create an answer file. You can find details on the submission pocess here, and follow the evaluation code and tests here

Changelog

Q&A

Q: But what about huggingface (dataset, hub, implementation)

I found it easier to use the allennlp framework, but I might consider using hf infrastructure as well in the future. Feel free to upload the dataset there, or suggest an implementation using hf codebase.

Q: If I find a bug?

It happens! Please open an issue and I'll do my best to address it.

Q: What about additional trained models?

I uploaded the best model we trained from the paper. If there's interest, I can upload the others as well. Open an issue or email me.

Why are there no labels in the released test-sets files?

We decided to keep the labels hidden, to avoid overfitting on this dataset. However, once you have a good model, you can upload your predictions to the leaderboard (and the ood leaderboard), and find out your score!