Open anonymous-for-ICML opened 4 years ago
Interesting work! Another related question for the authors (@xujinfan @quanmingyao ) here: How to apply C_2 constraint PA as suggested in step 4 of Alg. 2 in the paper? It seems like, the code uses a clip function but not at the same position proposed.
I found in your 'proximal_step', you directly apply the constraints to A and get a discrete architecture with 0,1 coding, which is also the same as the argmax function. I can not find any codes about how you use the PA to solve this problem. If we directly use argmax to sample architectures during the supernet training, the gradient can not be directly calculated. (as suggested by Stochastic NAS). I am really interested in your work about incorporate PA to NAS, could you help me to resolve this concern?
Thanks for your attention! The reason why we do that is learnt from "binary connect", you can refer to that. In standard PA, the max value will be retained, the others will be revised as 0. In order to make the architecture forward better, we revise the max value as 1. This means that we totally choose this operation.
Interesting work! Another related question for the authors (@xujinfan @quanmingyao ) here: How to apply C_2 constraint PA as suggested in step 4 of Alg. 2 in the paper? It seems like, the code uses a clip function but not at the same position proposed.
Yes, thanks for your advice. We found that clipping A at step4 or clip at step6 makes little difference.
I found in your 'proximal_step', you directly apply the constraints to A and get a discrete architecture with 0,1 coding, which is also the same as the argmax function. I can not find any codes about how you use the PA to solve this problem. If we directly use argmax to sample architectures during the supernet training, the gradient can not be directly calculated. (as suggested by Stochastic NAS). I am really interested in your work about incorporate PA to NAS, could you help me to resolve this concern?