LGTM, maybe 1 optimization could be to merge the megatron broadcast and the pipeline broadcast.
We separate them by purpose, because the megatron broadcast is normally inside a machine with NV-Link, and the pipeline broadcast is over ethernet. I'm not sure whether NCCL is smart enough for this.
We separate them by purpose, because the megatron broadcast is normally inside a machine with NV-Link, and the pipeline broadcast is over ethernet. I'm not sure whether NCCL is smart enough for this.