yangheng95 / PyABSA

Sentiment Analysis, Text Classification, Text Augmentation, Text Adversarial defense, etc.;
https://pyabsa.readthedocs.io
MIT License
949 stars 160 forks source link

Question about how the library works under the hood #309

Open torivor opened 1 year ago

torivor commented 1 year ago

Thank you everyone that has contributed on building such a helpful and comprehensive package for implementing ABSA🙏

I wish to learn how does the package's Aspect Polarity Classification (APC) module works. I understand the general idea of how the module works: it uses a pretrained model as its baseline, which can then be fine-tuned to better fit the user's dataset. However, I'm confused as to how the training or fine-tuning works under the hood. Specifically, I'd like to ask about these questions:

1.What kind of input data is required? Assuming that each line of the input data is a document consisting of multiple sentences, can I use the general Pandas dataframe as an input or should I use a CSV file?? Sorry if this has already been explained somewhere, but so far I haven't noticed any explicit example of the format in the documentation.

2.a.How does each model's tokenizer work? I'd like to see the specifics of what text pre-processing has been done towards the input data 2.b.How can I create my own tokenizer? Can I still use the pretrained model from the module with my custom tokenizer?? If that's not possible, then can I change how the existing tokenizer works???

yangheng95 commented 1 year ago

1: You refer to the dataset files for what the input should be like, and the inputs for training and inference are different. e.g., https://github.com/yangheng95/ABSADatasets/tree/v2.0/datasets/apc_datasets/100.CustomDataset

2: The tokenizer of pretrained model is based on the inplementation of transformers.

3: Generally, you dont need to implement your own tokenizer so that you can just try to load the trained model, e.g., load model from the provided checkpoints, if will also load the tokenizer, and you the tokenization will be handled by the loaded model. Dont worry, it works fine.

All examples for APC are available in: https://github.com/yangheng95/PyABSA/tree/v2/examples-v2/aspect_polarity_classification