Propose adding an additional assert statement in the MaskedSelfAttention class to verify that the number of attention heads matches the dim size. Otherwise, if self.dim is not divisible by self.num_heads without a remainder, there may be subtle and hard-to-detect errors inside nn.Module() when calling the TransformerLayer class in the future.
Propose adding an additional assert statement in the
MaskedSelfAttention
class to verify that the number of attention heads matches the dim size. Otherwise, ifself.dim
is not divisible byself.num_heads
without a remainder, there may be subtle and hard-to-detect errors insidenn.Module()
when calling theTransformerLayer
class in the future.