Thanks for your great work and I really appreciate that the code is released! While I have some questions when I try to carry out the experiments with distillation:
How is the process actually carried out? From the source code, I assume that given a teacher model (e.g. fine-tuned deberta), the student model (e.g. pretrained deberta, w/o fine-tuning) is first directly pruned by calling utils.prune, without any gradual pruning/scheduling process, and then trained by the target dataset plus distillation loss. Then what is the stored_model_path?
When running the script with deberta, it seems that there is some issue during the forward pass. With self.sparse_weight_pruned, self.SX, and self.SX_deberta all being not None, after the first forward it goes into the L116 of utils.py. However, SX is computed by the current x while self.SX is from the previous samples with different B or L, which leads to an error of size mismatch. May I know if this can be resolved? Also, after all, I wonder why self.SX and self.SX_deberta need to be stored.
Also, may I know the learning rate schedule in the experiments? According to the train_glue.sh a LR warmup steps of 480 is used for MNLI, but the LR warmup on other experiments is not mentioned in the paper.
Thanks for your great work and I really appreciate that the code is released! While I have some questions when I try to carry out the experiments with distillation:
How is the process actually carried out? From the source code, I assume that given a teacher model (e.g. fine-tuned deberta), the student model (e.g. pretrained deberta, w/o fine-tuning) is first directly pruned by calling
utils.prune
, without any gradual pruning/scheduling process, and then trained by the target dataset plus distillation loss. Then what is thestored_model_path
?When running the script with deberta, it seems that there is some issue during the forward pass. With
self.sparse_weight_pruned
,self.SX
, andself.SX_deberta
all being not None, after the first forward it goes into the L116 ofutils.py
. However,SX
is computed by the currentx
whileself.SX
is from the previous samples with different B or L, which leads to an error of size mismatch. May I know if this can be resolved? Also, after all, I wonder whyself.SX
andself.SX_deberta
need to be stored.Thank you!