Open Vickeyhw opened 9 months ago
If so, in the full-stage knowledge distillation, the image encoder is randomly initialized, is the mask decoder finetuned at a smaller learning rate than the light weight image encoder? Is this consistent with your implementation?
Yes, the weights for mask decoder are inherited from the teacher, and we use a smaller learning rate for mask decoder compared to image encoder.
If so, in the full-stage knowledge distillation, the image encoder is randomly initialized, is the mask decoder finetuned at a smaller learning rate than the light weight image encoder? Is this consistent with your implementation?