yangheng95 / PyABSA

Sentiment Analysis, Text Classification, Text Augmentation, Text Adversarial defense, etc.;
https://pyabsa.readthedocs.io
MIT License
952 stars 161 forks source link

Different performance between model saved as fine-tuned PLM and state_dict #389

Open zedavid opened 8 months ago

zedavid commented 8 months ago

Version PyABSA = 2.3.4rc0 Torch = 2.1.1 Transformers = 4.35.2

Describe the bug I've fine-tuned a model with config FAST_LSA_S_V2 using the same dataset using the APCTrainer. In one of the runs I saved it as a state_dict file and in the other a saved as PLM. I've then used the model on sample data using the APC.SentimentClassifier and the HF text-classification pipelines, but I get different results despite the model being trained the same way with the same data.

Code To Reproduce

Loading and testing the state_dict version:

sentiment_model = APC.SentimentClassifier('checkpoints/fast_lsa_s_v2_uber_acc_98.2_f1_98.14/')
examples = [
    "ty images city officials are standing their ground in defense of the law delivery workers like all workers deserve fair pay for their labor and we are disappointed that [B-ASP]Uber[E-ASP] doordash grubhub and relay disagree vilda vera mayuga head of the city's department of consumer and worker protection said in",
    "as prepared just a mile away for it to be delivered to me the food arrived stone cold i complained to [B-ASP]Uber[E-ASP] and [B-ASP]Uber[E-ASP] told me to go pound sand readers purporting to work for [B-ASP]Uber[E-ASP] left dozens of comments castigating me and others for not preemptively tipping arguing that such deliveries are not worth"
]

sentiment_model.predict(
    text=examples,
    eval_batch_size=32,
)

output:

[{'text': "ty images city officials are standing their ground in defense of the law delivery workers like all workers deserve fair pay for their labor and we are disappointed that Uber doordash grubhub and relay disagree vilda vera mayuga head of the city's department of consumer and worker protection said in",
  'aspect': ['Uber'],
  'sentiment': ['Negative'],
  'confidence': [0.9339152574539185],
  'probs': [array([0.93391526, 0.05876274, 0.007322  ], dtype=float32)],
  'ref_sentiment': ['-100'],
  'ref_check': [''],
  'perplexity': 'N.A.'},
 {'text': 'as prepared just a mile away for it to be delivered to me the food arrived stone cold i complained to Uber and Uber told me to go pound sand readers purporting to work for Uber left dozens of comments castigating me and others for not preemptively tipping arguing that such deliveries are not worthwh',
  'aspect': ['Uber', 'Uber', 'Uber'],
  'sentiment': ['Negative', 'Negative', 'Negative'],
  'confidence': [0.9557020664215088, 0.9557020664215088, 0.9557020664215088],
  'probs': [array([0.95570207, 0.03284235, 0.01145565], dtype=float32),
   array([0.95570207, 0.03284235, 0.01145565], dtype=float32),
   array([0.95570207, 0.03284235, 0.01145565], dtype=float32)],
  'ref_sentiment': ['-100', '-100', '-100'],
  'ref_check': ['', '', ''],
  'perplexity': 'N.A.'}]

With the HF text-classification pipeline

model_tokenizer = AutoTokenizer.from_pretrained('checkpoints/fast_lsa_s_v2_uber_acc_98.2_f1_98.14/fine-tuned-pretrained-model/')
model = AutoModelForSequenceClassification.from_pretrained('checkpoints/fast_lsa_s_v2_uber_acc_98.2_f1_98.14/fine-tuned-pretrained-model/')
sentiment_pipeline = pipeline('text-classification', model=model, tokenizer=model_tokenizer, device=1)
examples_no_tag = [{'text':re.sub(r"\[B-ASP\](.+?)\[E-ASP\]", r"\1", ex), 'text_pair': 'Uber'} for ex in examples]
sentiment_pipeline(examples_no_tag, top_k = 3)

Output:

[[{'label': 'Neutral', 'score': 0.38777175545692444},
  {'label': 'Positive', 'score': 0.3418353199958801},
  {'label': 'Negative', 'score': 0.27039292454719543}],
 [{'label': 'Neutral', 'score': 0.3863997459411621},
  {'label': 'Positive', 'score': 0.3450266420841217},
  {'label': 'Negative', 'score': 0.2685735821723938}]]

Expected behavior I would expect there would be some correspondence between the output probability in both versions of the model.

Thanks!

yangheng95 commented 8 months ago

The model saved as huggingface format is not intended as instant inference but further finetuing and the state_dict is the recommended save mode. If you want to run the model on pipeline, there is a model have been released at: https://huggingface.co/yangheng/deberta-v3-base-absa-v1.1

zedavid commented 8 months ago

I see. What is required to make that model available to be run with huggingface pipeline? Also, is there a checkpoint for the huggingface model? I would like to replicate the results I get with the pipeline with PyABSA.

yangheng95 commented 8 months ago

I am sorry for that, it is tricky to train models compatible with huggingface pipeline, and I have cleaned the original materials such as codes so I am afraid that I cannot provide detailed help for that.