xyq7 / GradSafe

Official Code for ACL 2024 paper "GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis"
Apache License 2.0
38 stars 6 forks source link

a question about the code #2

Closed huangkaipeng4399 closed 2 months ago

huangkaipeng4399 commented 2 months ago

Hello! Sorry to bother, but I have a problem to consult you about the code.

When reading your code, I found that in code/find_critical_parameters.py, when you specifying the label of the input text, you use -100 to mask the preceding tokens of "sure". I want to ask why not just specify the label as "sure"? What's the pro of the way your code takes?

Thanks!!

xyq7 commented 2 months ago

Thank you for your question! The masking-based implementation is just one approach to calculating gradients solely based on the output “Sure” within the current framework. Other implementations may also work using the same logic. Thanks!

huangkaipeng4399 commented 2 months ago

Got it. Thanks again!