Difference between moderate and larger models, with and without TSA models

Hello, I would like to first thank you for sharing your great work! I have some questions regarding the number of channel used in your model.

So my question is that,

1) Regardless of moderate (M) or larger (L) model, what is the purpose of model without and with TSA model?

In your code, you provided training options for moderate model (64 channel) without and with TSA module, and I saw that moderate SR model with TSA module is trained with pretrained weights without TSA. And in the paper section 4.1 Training Datasets and Details, there is a sentence that "We initialize deeper networks by parameters from shallower ones for faster convergence."

Is the purpose of model without TSA model is to initialize weights of the complete model with TSA? OR just for ablation studies to confirm the effect of TSA module?

2) What is the different purpose of moderate model(M) and large(L) model?

In your paper section 3.4 Two-Stage Restoration, there is a sentence that "Specifically, a similar but shallower EDVR network is cascaded to refine the output frames of the first stage".

Is the large model(L - deeper) used for the first stage training and the moderate model (M - shallower) used for the second stage training?

Or is the moderate model is just for demonstrating simpler version, and the larger model is what you actually used for the stage 1 training and the difference between the stage 1 model and stage 2 model is that the number of RB in the reconstruction stage is 40 and 20 in each stage which are mentioned in your paper section 4.1 Training Details (but the number of RB in the PCD alignment is 5 both in stage 1 and stage 2 model, right)?

Thank you!

xinntao / EDVR

Difference between moderate and larger models, with and without TSA models #144