yxli2123 / LoSparse

42 stars 5 forks source link

Questions regarding distillation #4

Open mutiann opened 1 month ago

mutiann commented 1 month ago

Thanks for your great work and I really appreciate that the code is released! While I have some questions when I try to carry out the experiments with distillation:

  1. How is the process actually carried out? From the source code, I assume that given a teacher model (e.g. fine-tuned deberta), the student model (e.g. pretrained deberta, w/o fine-tuning) is first directly pruned by calling utils.prune, without any gradual pruning/scheduling process, and then trained by the target dataset plus distillation loss. Then what is the stored_model_path?

  2. When running the script with deberta, it seems that there is some issue during the forward pass. With self.sparse_weight_pruned, self.SX, and self.SX_deberta all being not None, after the first forward it goes into the L116 of utils.py. However, SX is computed by the current x while self.SX is from the previous samples with different B or L, which leads to an error of size mismatch. May I know if this can be resolved? Also, after all, I wonder why self.SX and self.SX_deberta need to be stored.

Thank you!

mutiann commented 1 month ago

Also, may I know the learning rate schedule in the experiments? According to the train_glue.sh a LR warmup steps of 480 is used for MNLI, but the LR warmup on other experiments is not mentioned in the paper.