Open xiningnlp opened 2 months ago
Hi @xiningnlp . Thanks for the question!
The reason for the error is that: Currently, the implementation of Adam-mini only supports the case where num_attention_heads / num_gpu is an integer. In your case on Qwen 0.5B, num_attention_heads / num_gpu = 14 / 4 = 3.5 is not an integer, so it causes the error.
Thanks for mentioning this We will try to support more flexible choice of num_attention_heads in the future. For now, you can try the following simple tweaks.
optimizer.wqk_names = {}
This will force Adam-mini to treat Q and K similarly as regular MLP layers, and thus will not involve any head-related partition operations. For SFT, such changes usually will not cause performance degradation. Yet, please tell us if you observe any.
Authors
Hi @xiningnlp . Thanks for the question!
The reason for the error is that: Currently, the implementation of Adam-mini only supports the case where num_attention_heads / num_gpu is an integer. In your case on Qwen 0.5B, num_attention_heads / num_gpu = 14 / 4 = 3.5 is not an integer, so it causes the error.
Thanks for mentioning this We will try to support more flexible choice of num_attention_heads in the future. For now, you can try the following simple tweaks.
- try num_gpu = 2 or 7. In these cases, num_attention_heads / num_gpu will be an integer.
- still use num_gpu = 4 but try to train any other scaled, such as 1.5B, 7B, etc. All the rest of QWen have num_attention_heads to be the multiples of 4, and thus shall not raise error. Actually, Qwen 0.5B is the only exception that num_attention_heads is not the multiples of 4.
- If you do not intend to change anything on num_gpu or architectures. You can add the following line after creating the optimizer.
optimizer.wqk_names = {}
This will force Adam-mini to treat Q and K similarly as regular MLP layers, and thus will not involve any head-related partition operations. For SFT, such changes usually will not cause performance degradation. Yet, please tell us if you observe any.
Authors
@zyushun Thanks for your prompt reply. But on the other hand, theoretically, there is no need to use Adam-mini for a 0.5B SLM since it won't consume too much GPU MEM to SFT (full parameters) a 0.5B SLM, right? and practically, I received no gain in my experiment, was this observation expected.
Hi @xiningnlp . Yes, you are right. For 0.5B, optimizer memory is not a heavy overhead.
For Adam-mini, it does not save much memory over Adam for 0.5B models. This is because: Adam-mini still uses AdamW for the embedding layer, and embedding layer takes a large proportion of total params for 0.5B models. So it is as expected if you did not observe much memory cut-down by Adam-mini for 0.5B models.
Nevertheless, things will be different when the model size increases to > 1B. In these cases, the proportion of embedding layer shrinks to <10%, and the memory gain of Adam-mini starts to be significant (you will see ~50% cut down over Adam).
Hi all,
I found that using Adam-mini 1.0.1 cannot run in 4 shards, it would threw the exception related to Tensor reshaping:
Having debug the issue, I found Adam-mini failed to calculate the "m" value for model.layers.0.self_attn.q_proj.weight
To understand the above exception, I pasted the configs of Qwen2-0.5B as follows:
which is expected!
which is also expected in deepspeed Zero-3 and 4 cards settings
Does the above observation indicate whether Adam-mini can be used in Deepspeed environment depends on the shard number and the hidden states dim?