Question on the implementation of feature perturbation

Gavinwxy commented 1 year ago

Hi,

Thanks for sharing the code. I am trying to figure out the design of feature perturbation. One questions is that, in this line of code, what is the purpose of subtracting 0.5 from the random vector? Is it simply an engineering trick or related to some theoretical considerations?

yyliu01 commented 1 year ago

Hi @Gavinwxy

Sorry for the late reply, I was in a rush to prepare a rebuttal, and I hope you are still interested in this topic.

In short, rather than calculating a random vector, we infer the VAT perturbation noise, as demonstrated in the code below.

pred_hat = (decoder1(x_detached + xi * d) + decoder2(x_detached + xi * d))/2

The perturbation is based on virtual adversarial training. The motivation is to maximize the distance between the perturbed prediction and its original version. Given that the network will remain fixed, the gradient-guided noise is calculated to maximize the distance, and such VAT noise is intended to confuse the network as much as possible. To archive the satisfied perturb effect, the pseudo label's quality is important as it determines the gradient direction.

In our work, we've found the teachers/students are in same domain (please feel free to mean-average the parameters of both teachers/students and check the results) and the ensemble of the prediction provides more reliable pseudo-labels during the training process. Hence, we employ teacher ensemble to calculate the VAT perturbation in this line .

Cheers, Yuyuan

Gavinwxy commented 1 year ago

Thanks for your reply. I will check the original VAT paper for more details.

yyliu01 commented 1 year ago

You are very welcome.

yyliu01 / PS-MT

Question on the implementation of feature perturbation #32