use fixed number of heads / intermediate size per layer

aaronkl commented 2 weeks ago

Is your feature request related to a problem? Please describe.

Currently, GPT has two functions: set_sub_network, which expects a list with the number of heads and intermediate sizes per layer, and select_sub_network, which expects a configuration with a fixed number of heads and intermediate sizes for all layers. Extracting sub-networks with a flexible number of heads or intermediate sizes per layer is not supported by LitGPT, and therefore we cannot evaluate these models on downstream tasks using LM-Eval. As a result, we may not support this feature and instead simplify the code.

Describe the solution you'd like Remove set_sub_network and only support a fixed number of heads / intermediate size for all layers.

Additional context related to #137

gabikadlecova commented 1 week ago

During the fix, it would be good to move the sub_network_head_size and sub_network_query_groups fields from the CausalSelfAttention blocks to GPT.

n_query_groups and head_size have different values for subnetworks, because they are computed in CausalSelfAttention when calling set_sub_network (based on n_head etc.). Then, when extracting the subnet, you have to recompute them again, as they are different from the supernet config values. Setting the fields in GPT would avoid the duplicated computation.

rheasukthanker commented 1 week ago

Makes sense to me. I will get to this issue only Wednesday. Let me know if anyone wants to take this up, before that.

whittle-org / whittle

use fixed number of heads / intermediate size per layer #146