yangheng95 / ABSADatasets

Public & Community-shared datasets for Aspect-based sentiment analysis and Text Classification
MIT License
207 stars 64 forks source link

how to use my customized data? #10

Closed WeiLi9811 closed 3 years ago

WeiLi9811 commented 3 years ago

hi, yangheng! thanks so much for your excellent work, but i was wondering how i can use my customized data(json/csv file), is there any solution i could follow in your package?

yangheng95 commented 3 years ago

Hello, There is no method to read csv or json dataset currently, It may be the future feature. But you can reformat your dataset according to the provided APC dataset and get the ATEPC datasets and inference sets by PyABSA entries (see the Readme in ABSADatasets).

WeiLi9811 commented 3 years ago

thanks a lot!

Hello, There is no method to read csv or json dataset currently, I may be the future feature. But you can reformat your dataset according to the provided APC dataset and get the ATEPC datasets and inference sets by PyABSA entries (see the Readme in ABSADatasets).

feemthan commented 2 years ago

@yangheng95 Excellent work on the PyABSA and this git repos!

I am facing some issues while creating my custom dataset. I used my own dataset using the tools provided in this repo in the recommended {id}.{name} format first and it comes up with "Target 2 out of bounds" error. When I use the yelp dataset in the {id}.{name} format, the training completes but ends up with a tensor size [3, 768] mismatch with tensor [3, 3] error.

How should I resolve these ?

yangheng95 commented 2 years ago

Hi, @feemthan Can you show me some example of your dataset?

feemthan commented 2 years ago

This is on APC FAST but could occur on other models as well.

The configs I used for both datasets were config = APCConfigManager.get_apc_config_english() config.model = APCModelList.FAST_LSA_T config.cache_dataset = False config.num_epoch = 1 config.seed = 0

For finishes the training but gives the tensor [3, 768] [3, 3] error: I reused yelp as a custom dataset by renaming it to 661/661.yelp.train.apc.txt, test and valid similarly. """ Been wanting to check this place out for awhile, so I came in on a whim after lunch. The counter girl gave me the impression she wasn' t happy to be there since she just stared at me without smiling when I walked in. You can' t substitute veggies for rice, it' s an add on, you can only get the spring rolls fried( gross), I ordered the chicken curry with veggies to go, the place is small, clean and quiet. The dish was bland, the" $T$" were 3 pieces of carrot and 2 pieces of potato. It' s inexpensive, and close to my house, but I will not visit again. veggies Negative Wanted to check out this establishment last Friday afternoon, as I live fairly close in Medford and it had descent reviews. Well I was quite pleased! Loved the variations on menu, $T$ were reasonable, food was tasty! They were very kid friendly as I was with my active 2 year old it helps they have TVS by each booth. The ambiance very cool and sophisticated yet comfy. Bar scene was seemed fun professional and locals that were friendly having great time. We truly enjoyed our food. Loved the pulled pork sandwich and the nachos were fabulous! Will definitely come back! Our server Mark and the rest of the staff worked well together as a team! Thank you for a great time! prices Neutral Wanted to check out this establishment last Friday afternoon, as I live fairly close in Medford and it had descent reviews. Well I was quite pleased! Loved the variations on menu, prices were reasonable, $T$ was tasty! They were very kid friendly as I was with my active 2 year old it helps they have TVS by each booth. The ambiance very cool and sophisticated yet comfy. Bar scene was seemed fun professional and locals that were friendly having great time. We truly enjoyed our food. Loved the pulled pork sandwich and the nachos were fabulous! Will definitely come back! Our server Mark and the rest of the staff worked well together as a team! Thank you for a great time! food Positive """" For Target 2 out of bounds error:

""" I am listening to my mums conversation on the phone she is now sobbing after your advisor has been so $T$ and obstructive She will be leaving them today I am disgusted rude Negative I am listening to my mums conversation on the phone she is now sobbing after your advisor has been so rude and $T$ She will be leaving them today I am disgusted obstructive Negative Usual shambles trying to contact about a complaint there is a minute delay due to the $T$ minutes later Im still waiting Seriously the service has fallen from poor to poor amp misleading can anyone suggest an alternative business coronavirus Negative """ Thank you for your work. I am happy to help anytime to contribute to this/PyABSA repos

yangheng95 commented 2 years ago

Hello, I see, can you show me your whole console output for locating issue?

yangheng95 commented 2 years ago

I append these examples to Yelp dataset and it work fine, image

see the script:


from pyabsa.functional import Trainer
from pyabsa.functional import APCConfigManager
from pyabsa.functional import ABSADatasetList
from pyabsa.functional import APCModelList

apc_config_english = APCConfigManager.get_apc_config_english()
apc_config_english.model = APCModelList.FAST_LSA_T
apc_config_english.num_epoch = 10
apc_config_english.evaluate_begin = 2
apc_config_english.pretrained_bert = 'microsoft/deberta-v3-base'
apc_config_english.similarity_threshold = 1
apc_config_english.max_seq_len = 80
apc_config_english.dropout = 0.5
apc_config_english.seed = 2672
apc_config_english.log_step = 50
apc_config_english.l2reg = 1e-8
apc_config_english.dynamic_truncate = True
apc_config_english.srd_alignment = True

Dataset = ABSADatasetList.Yelp
sent_classifier = Trainer(config=apc_config_english,
                          dataset=Dataset,
                          checkpoint_save_mode=0,
                          auto_device=True
                          ).load_trained_model()

examples = [
    'Strong build though which really adds to its [ASP]durability[ASP] .',  # !sent! Positive
    'Strong [ASP]build[ASP] though which really adds to its durability . !sent! Positive',
    'The [ASP]battery life[ASP] is excellent - 6-7 hours without charging . !sent! Positive',
    'I have had my computer for 2 weeks already and it [ASP]works[ASP] perfectly . !sent! Positive',
    'And I may be the only one but I am really liking [ASP]Windows 8[ASP] . !sent! Positive',
]

inference_sets = examples

for ex in examples:
    result = sent_classifier.infer(ex, print_result=True)
feemthan commented 2 years ago

I have run a similar script,

from pyabsa.functional import ABSADatasetList from pyabsa.functional import APCModelList from pyabsa.functional import APCConfigManager from pyabsa.functional import Trainer

from autocuda import auto_cuda import os import warnings warnings.filterwarnings("ignore")

config = APCConfigManager.get_apc_config_english() # APC task config.model = APCModelList.FAST_LSA_T config.cache_dataset = False config.num_epoch = 1 config.seed = 0 classifier = Trainer( config=config, dataset='661.yelp', # train set and test set will be automatically detected checkpoint_save_mode = 1, auto_device= 'cpu' # automatic choose CUDA or CPU # If I enable cuda, it will throw me a side asset error ) but my yelp is custom. I want to build custom datasets for training following your instructions. If you need the logs I can put them on drive as they are huge.

yangheng95 commented 2 years ago

I update the README just now, you can try remove the cloned datasets, and put your data into the files in 100.CustomDataset folder, if the issue continues, please feel free to report and discuss it. Hopefully every issue can be resolved just in time.

yangheng95 commented 2 years ago

And make sure your data was not corrupted, you can share your dataset slice (which can trigger this error) with me to debug, then it can be solved soon.

yangheng95 commented 2 years ago

Hi, @feemthan Did you solve this issue?

feemthan commented 2 years ago

I am activtely working on solving this issue. But its not optimized for WSL which means that i need to run it on Ubuntu. Halting my other workspaces.

feemthan commented 2 years ago

Ive tried this on the yelp dataset posing as a custom dataset. The issue persists. Just to reiterate, this is the train running successfully but size mismatch [3, 768] from checkpoint to current shape [3, 3] error occurs at self.model.load_state_dict(torch.load(find_file(save_path, '.state_dict'))) in apc_trainer.py.

yangheng95 commented 2 years ago

Please paste your console output here for analyzing. BTW, please remove conflict checkpoint(ie., same model trained trained on different dataset), in case of avoiding findfile finds error checkpoints

feemthan commented 2 years ago

Yes I noticed that findfile was throwing warnings. Removed all the checkpoints before giving the previous comment. Will post the console output here in a sec.

feemthan commented 2 years ago

I found that if we need the state dict of the model with cross_validate_fold > 0 to be saved to finish training for yelp. Maybe an optimization to the if self.opt.cross_validate_fold > 0 might help, mainly because we dont know which epoch the model will be saved. This might slow down training so I will let you decide. This resolved my issue of custom datasets for yelp. Will have to look at my own dataset later today. Your help was much appreciated, respect++.

yangheng95 commented 2 years ago

Thanks for your feedback, please feel free to advise or PR