More Datasets - Githubissues

yangheng95 / ABSADatasets

Public & Community-shared datasets for Aspect-based sentiment analysis and Text Classification

MIT License

207 stars 64 forks source link

More Datasets #1

Closed janpf closed 3 years ago

janpf commented 3 years ago

Hi, Would you mind adding Laptop 15, 16 and Hotel from SemEval? As those are identically formatted to Restaurant 15, 16 I think they should import rather cleanly 😄

Additionally I'd suggest adding the datasets from https://github.com/rajdeep345/ABSA-Reproducibility/tree/main/code/datasets/semeval14, as those are already in identical format (as far as I can tell) and "only" the ATE part is missing.

Thanks for your work!

yangheng95 commented 3 years ago

That is a good suggestion, I will do it later.

yangheng95 commented 3 years ago

I add two datasets from ABSA-Reproducibility few days ago, but I didnt find easy-to-adapt L15 and 16 datasets, can you give me a reference? Thank you for you help!

janpf commented 3 years ago

I add two datasets from ABSA-Reproducibility few days ago

Thanks!

but I didnt find easy-to-adapt L15 and 16 datasets, can you give me a reference?

Sure! I found them on the original task pages: https://alt.qcri.org/semeval2015/task12/index.php?id=data-and-tools https://alt.qcri.org/semeval2016/task5/index.php?id=data-and-tools

Where did you find the other datasets? Is there another source?

yangheng95 commented 3 years ago

I add the Thirt and Television datasets from the link of ABSA-Reproducibility.

thanks! yet, these datasets are not processed into the recommended format, can you share the processed datasets instead? since I am working on other topics, I may not be able to process the dataset in time.

janpf commented 3 years ago

I add the Thirt and Television datasets from the link of ABSA-Reproducibility.

Thanks!

thanks! yet, these datasets are not processed into the recommended format, can you share the processed datasets instead?

Ah, I thought you had a converter script for semeval-format => your format. Would you mind sharing, where you got your datasets from, if not from semeval? Maybe they have a script? ;)

yangheng95 commented 3 years ago

Unfortunately we have to do reformat by self-coding. As far as I known, there is not a script can do this for us.

kunalverma75 commented 3 years ago

@yangheng95 can you check once if my dataset format for SemEval2016 Task5 Subtask1 for APC in Dutch language is correct or not. I would like to share other multilingual datasets to your repositories as well. Attaching the file for your reference SemEval.Dutch.train.apc.txt

yangheng95 commented 3 years ago

@yangheng95 can you check once if my dataset format for SemEval2016 Task5 Subtask1 for APC in Dutch language is correct or not. I would like to share other multilingual datasets to your repositories as well. Attaching the file for your reference SemEval.Dutch.train.apc.txt

Hello, Thanks for your sharing. The format is correct, and the polarity labels are valid. You can PR your datasets with copyrights information, e.g., source and processed by who. I will merge it and register the dataset in PyABSA after the necessary test and conversion to ATEPC format. Thanks again.

yangheng95 commented 3 years ago

I will close this issue because it is inactive for 3 weeks.