Open December-boy opened 3 years ago
Thanks for your issue. To implement the training and search of supernet, we need to set the head number of each batch. Therefore, the first "head_dim" is only used to compute the scale value (self.scale), where the "self.heads = None" is leveraged during the forward process (please refer to core/model/net.py). According to the selected architecture, the value of "self.heads" is changed for each batch data.
Thanks for you reply
One more thing, I wonder how to set the selected architecture, since the init supernet is like a block-level search space, each block is the same with 4 heads and 1440 output dimension.
One more thing, I wonder how to set the selected architecture, since the init supernet is like a block-level search space, each block is the same with 4 heads and 1440 output dimension. Oh, I figure out it.
Thanks for your question. To implement the retraining process of a searched architecture, you can refer to config/retrain/ViTAS_1G_retrain.yaml. As in lines 82 and 122, the "net_id" defines the retrained architecture with the pre-set search space (78-81 lines). Moreover, you can also use a pre-defined model as the retrained architecture, as in config/retrain/ViTAS_1.3G_retrain.yaml (80-83 lines); with this setting, you can directly train your defined architecture and do not need to use the "net_id" in your yaml.
Thanks for your question. To implement the retraining process of a searched architecture, you can refer to config/retrain/ViTAS_1G_retrain.yaml. As in lines 82 and 122, the "net_id" defines the retrained architecture with the pre-set search space (78-81 lines). Moreover, you can also use a pre-defined model as the retrained architecture, as in config/retrain/ViTAS_1.3G_retrain.yaml (80-83 lines); with this setting, you can directly train your defined architecture and do not need to use the "net_id" in your yaml.
Thanks! Can you tell me the cost of the search or the whole process? What type of GPU was used, how many GPUs and how many days did it take?
Thanks for your question. I leveraged 32 X V100 cards with 32G GPU RAM each to implement the search.
It takes about 2-3 days for searching an ViT architecture.
It takes about 2-3 days for searching an ViT architecture.
I‘ve trained the supernet, and the sampled results is strange. As shown in the followed figure, the test results is very low. Is that a normal situation?
Yes, during sampling, the accuracy of ViT architecture is relatively low in supernet.
Thanks for your nice work. I wonder the heads set in the Attention is None, is this means the heads are set to 4 in the supernet? As listed in the paper, the heads are selected in {3, 6, 12, 16}.