Closed wsynuiag closed 6 days ago
Hi,
Similar to the BTI-DBF, we determine a model as backdoored according to whether BAN can correctly identify the target class.
For example, the prediction results of a WaNet model is:
Prediction distribution: [5000. 0. 0. 0. 0. 0. 0. 0. 0. 0.] Prediction targets to: 0
For a benign model: Prediction distribution: [631. 279. 542. 438. 691. 735. 494. 544. 459. 187.] Prediction targets to: 5
Thanks for your quick reply. From the example above, will the benign model be misclassified as a backdoor model?
It's not impossible. However, the prediction will be evenly (relatively) distributed in each class for benign models. For backdoored models, it is concentrated in the target class.
Thank you. So I need to judge it manually, or is there any automatic way (e.g. threshold, metrics) to define a backdoor model?
We can easily have a script to check the detection results. In our case, we use the number of samples (>2500) predicted as the target class. But in fact, we also manually checked all detection results including other baselines.
I understand. Thanks for your time.
I understood the mechanism in ban_detection.py. However, from the results like acc, reg, how can I judge a model is backdoored or not? Specifically, what are the thresholds used in the paper under different model architectures and datasets?