Open aaronkl opened 2 weeks ago
During the fix, it would be good to move the sub_network_head_size
and sub_network_query_groups
fields from the CausalSelfAttention
blocks to GPT
.
n_query_groups
and head_size
have different values for subnetworks, because they are computed in CausalSelfAttention
when calling set_sub_network
(based on n_head
etc.). Then, when extracting the subnet, you have to recompute them again, as they are different from the supernet config values. Setting the fields in GPT
would avoid the duplicated computation.
Makes sense to me. I will get to this issue only Wednesday. Let me know if anyone wants to take this up, before that.
Is your feature request related to a problem? Please describe.
Currently, GPT has two functions:
set_sub_network
, which expects a list with the number of heads and intermediate sizes per layer, andselect_sub_network
, which expects a configuration with a fixed number of heads and intermediate sizes for all layers. Extracting sub-networks with a flexible number of heads or intermediate sizes per layer is not supported by LitGPT, and therefore we cannot evaluate these models on downstream tasks using LM-Eval. As a result, we may not support this feature and instead simplify the code.Describe the solution you'd like Remove
set_sub_network
and only support a fixed number of heads / intermediate size for all layers.Additional context related to #137