yangheng95 / ABSADatasets

Public & Community-shared datasets for Aspect-based sentiment analysis and Text Classification
MIT License
208 stars 64 forks source link
absa dataset pyabsa

ABSA datasets for PyABSA

total views total views per week total clones total clones per week

Need to Annotate Your Own Dataset?

Auto-annoate your datasets via PyABSA!

There is an experimental feature which allows you to auto-build APC dataset and ATEPC datasets, see the usage here:

from pyabsa import make_ABSA_dataset 
# refer to the comments in this function for detailed usage
make_ABSA_dataset(dataset_name_or_path='integrated_datasets/review', checkpoint='english')

To augment your datasets, please refer to BoostTextAugmentation

Format your dataset to use it in PyABSA

Important: Rename your dataset filename before use it in PyABSA

It is recommended to assign an id for your dataset, which will avoid some potential problem (e.g., dataset mis-loading) while PyABSA detects the dataset. Merging your datasets into ABSADatasets, please keep the id remained.

datasets
├── apc_datasets
│     ├── 101.restaurant
│     │    ├── restaurant.train.dat.apc  # train_dataset
│     │    ├── restaurant.test.dat.apc  # test_dataset
│     │    └── restaurant.valid.dat.apc  # valid_dataset, dev set are not recognized in PyASBA, please rename dev-set to valid-set
│     └── others
├── atepc_datasets
datasets
├── 101.restaurant
│    ├── restaurant.train.dat.atepc  # train_dataset
│    ├── restaurant.test.dat.atepc  # test_dataset
│    └── restaurant.valid.dat.atepc  # valid_dataset, dev set are not recognized in PyASBA, please rename dev-set to valid-set
└── others

I prepare a demo custom APC/ATEPC dataset which is based on third-party annotated Yelp dataset. Iif you got problem in dataset renaming, please put your data into the prepared dataset files. Check datasets/apc_datasets/100.CustomDataset and datasets/atepc_datasets/100.CustomDatasetto view or rewrite the custom dataset.

Then, use the {id}.{dataset name} to locate your dataset, e.g.,

from pyabsa.functional import APCConfigManager
from pyabsa.functional import Trainer
from autocuda import auto_cuda

config = APCConfigManager.get_apc_config_english()  # APC task
dataset = '101.restaurant'
# dataset = '100.CustomDataset'
Trainer(config=config,
        dataset=dataset,  # train set and test set will be automatically detected
        checkpoint_save_mode=1,
        auto_device=auto_cuda()  # automatic choose CUDA or CPU
        )

Notice

All datasets provided are for research only, we do not hold any Copyright of any datasets. These datasets follow their original licenses (if any).

Datasets source:

MAMS https://github.com/siat-nlp/MAMS-for-ABSA

SemEval 2014: https://alt.qcri.org/semeval2014/task4/index.php?id=data-and-tools

SemEval 2015: https://alt.qcri.org/semeval2015/task12/index.php?id=data-and-tools

SemEval 2016: https://alt.qcri.org/semeval2016/task5/index.php?id=data-and-tools

Chinese: https://www.sciencedirect.com/science/article/abs/pii/S0950705118300972?via%3Dihub

Shampoo: brightgems@GitHub

MOOC: jmc-123@GitHub with GPL License

Twitter: https://dl.acm.org/doi/10.5555/2832415.2832437

Television & TShirt: https://github.com/rajdeep345/ABSA-Reproducibility

Yelp: WeiLi9811@GitHub

SemEval2016Task5: YaxinCui@GitHub

English-MOOC github: aparnavalli@GitHub

Contribute (prepare) your dataset to ABSADatasets to use it in PyABSA

We hope you can share your custom dataset or an available public dataset. If you are willing to, please follow the instruction to process your data and open a PR.