xyq7 / GradSafe

Official Code for ACL 2024 paper "GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis"
Apache License 2.0
31 stars 5 forks source link

Precision is about 0.444? #1

Closed shanpoyang654 closed 4 months ago

shanpoyang654 commented 5 months ago

Thanks for your code!

I am reaching out to discuss some observations I've made while utilizing your codebase. I've conducted a series of tests using the XSTEST dataset located at /data/xstest/xstest_v2_prompts.csv and have not made any alterations to the original code.

Upon running the tests, I've encountered the following results:

Precision: 0.4444444444444444 Recall: 1.0 F1 Score: 0.6153846153846153 AUPRC: 0.2706038220880872

I've noticed that the Precision appears to be quite low, which has had an impact on the overall F1 Score and AUPRC. I am wondering if this could potentially be related to the threshold setting for classification, which is currently prescribed at 0.25 within your code repository.

Would you recommend adjusting this threshold to potentially improve the precision? Or do I need to do other adjustment?

I truly appreciate any insights or recommendations you might have on this matter and look forward to your advice.

Thank you for your time and consideration.

shanpoyang654 commented 5 months ago

I've noticed that the Precision appears to be quite low, which has had an impact on the overall F1 Score and AUPRC. I am wondering if this could potentially be related to the threshold setting for classification, which is currently prescribed at 0.25 within your code repository.

And I adjust threshold from 0.25 to 0.999 in predicted_labels = [1 if feature >=0.25 else 0 for feature in cos_all], but the precision is still 0.444: Precision: 0.4444444444444444 Recall: 1.0 F1 Score: 0.6153846153846153 AUPRC: 0.2706038220880872

shanpoyang654 commented 5 months ago

I've noticed that the Precision appears to be quite low, which has had an impact on the overall F1 Score and AUPRC. I am wondering if this could potentially be related to the threshold setting for classification, which is currently prescribed at 0.25 within your code repository.

And I adjust threshold from 0.25 to 0.999 in predicted_labels = [1 if feature >=0.25 else 0 for feature in cos_all], but the precision is still 0.444: Precision: 0.4444444444444444 Recall: 1.0 F1 Score: 0.6153846153846153 AUPRC: 0.2706038220880872

It seems that row_cos = torch.nan_to_num(F.cosine_similarity(grad_norm, (gradient_norms_compare[name]), dim=1)) # row sim returns cosine similarity which is out of [-1,1] when meeting inf / -inf .

xyq7 commented 5 months ago

Hi, Thanks for your interest, Would you please let me know what the base model you apply for evaluation?

xyq7 commented 5 months ago

It is suggested to apply the Llama-2 chat model "https://huggingface.co/meta-llama/Llama-2-7b-chat-hf" to reproduce the results. Thanks!

shanpoyang654 commented 5 months ago

It is suggested to apply the Llama-2 chat model "https://huggingface.co/meta-llama/Llama-2-7b-chat-hf" to reproduce the results. Thanks!

Thank you for your reply! I use 'Llama2-7b-chat-hf' and got this precision for XSTest. Precision: 0.4444444444444444 Recall: 1.0 F1 Score: 0.6153846153846153 AUPRC: 0.2706038220880872 And I did not change any code except some code about tensor's device. for model_id in ['./model/Llama-2-7b-chat-hf']: gradient_norms_compare, minus_row_cos, minus_col_cos = find_critical_para(model_id, device) df = pd.read_csv('./data/xstest/xstest_v2_prompts.csv')

And I noticed that when I run the program, the features are far beyond 0-1(almost 200), and the final output of predicted_labels is all 1. predicted_labels = [1 if feature >=0.25 else 0 for feature in cos_all]

Hope for your kind reply! Thanks for your effort and your open-sourced code!

xyq7 commented 5 months ago

Thanks for your reply! As we have reproduced the code with several different devices and haven't met such problem before, we may need more information to debug this problem with you. As the feature is calculated with cos similarity function, do you have some idea about why it would be 200? Thanks!

shanpoyang654 commented 5 months ago

Thanks for your reply! As we have reproduced the code with several different devices and haven't met such problem before, we may need more information to debug this problem with you. As the feature is calculated with cos similarity function, do you have some idea about why it would be 200? Thanks!

In def find_critical_para(model_id, device) function: row_cos = torch.nan_to_num(F.cosine_similarity(grad_norm, (gradient_norms_compare[name]), dim=1)) # row sim col_cos = torch.nan_to_num(F.cosine_similarity(grad_norm, (gradient_norms_compare[name]), dim=0)) # col sim When I run this code, row_cos contains 65504, and I think it is because that F.cosine_similarity()'s results contains inf/nan/-inf, and therefore torch.nan_to_num() transforms it into maximum number of torch.Float.

Have you ever meet this problem? Hope for your kind reply! And may you give me some suggestions to solve it? Thank you!

xyq7 commented 5 months ago

Thanks for your reply for more information! We haven't met the problem before.

Could you please check the intermediate results such as grad_norm/gradient_norms_compare[name] for possible inf/nan/-inf?

As they are gradients, perhaps we can also check whether there are some problems regarding the model backward (e.g., loss).

shanpoyang654 commented 5 months ago

Thanks for your reply for more information! We haven't met the problem before.

Could you please check the intermediate results such as grad_norm/gradient_norms_compare[name] for possible inf/nan/-inf?

As they are gradients, perhaps we can also check whether there are some problems regarding the model backward (e.g., loss).

I'm sorry but I still find that in row_cos = torch.nan_to_num(F.cosine_similarity(grad_norm, (gradient_norms_compare[name]), dim=1), nan=0.0, posinf=1.0, neginf=-1.0) # row sim, F.cosine_similarity(grad_norm, (gradient_norms_compare[name]), dim=1) contains inf. I re-download the code and re-run this, but still have the result (xstest dataset) as below: Precision: 0.4444444444444444 Recall: 1.0 F1 Score: 0.6153846153846153 AUPRC: 0.2706038220880872

toxic-chat dataset result is as below: Precision: 0.07200472162108991 Recall: 1.0 F1 Score: 0.13433657551844375 AUPRC: 0.04106359311386625

And I think it is because the model predicts all samples as '1'.

Did you ever get results as mine? I'm confused about this.

Hope for your kind reply!

xyq7 commented 4 months ago

Thanks for your reply for more information. We haven't met the problem before. It seems the main problem is why the cosine_similarity calculation contains Inf, which we haven't met before. Could you please check in your running, Are there any abnormal gradients? Is the loss normal? It would be also helpful if you could provide more information such as what's the modification of the code.

Looking forward to your further information. You can also directly send me an email yxieay@connect.ust.hk if I didn't check GitHub issue in time.