Questions regarding distillation

Thanks for your great work and I really appreciate that the code is released! While I have some questions when I try to carry out the experiments with distillation:

How is the process actually carried out? From the source code, I assume that given a teacher model (e.g. fine-tuned deberta), the student model (e.g. pretrained deberta, w/o fine-tuning) is first directly pruned by calling utils.prune, without any gradual pruning/scheduling process, and then trained by the target dataset plus distillation loss. Then what is the stored_model_path?
When running the script with deberta, it seems that there is some issue during the forward pass. With self.sparse_weight_pruned, self.SX, and self.SX_deberta all being not None, after the first forward it goes into the L116 of utils.py. However, SX is computed by the current x while self.SX is from the previous samples with different B or L, which leads to an error of size mismatch. May I know if this can be resolved? Also, after all, I wonder why self.SX and self.SX_deberta need to be stored.

Thank you!

yxli2123 / LoSparse

Questions regarding distillation #4