Open zhangjun opened 1 year ago
https://colossalai.org/docs/concepts/paradigms_of_parallelism/ https://www.zhihu.com/question/508671222/answer/2290801813 https://www.cnblogs.com/marsggbo/p/16871789.html
all-reduce
all-gather
model is split by layer into several chunks, each chunk is given to a device.
归一化方法
Batch Norm
Batch Norm在通道维度进行归一化,最后得到C个统计量u,δ。假设输入特征为[N, H, W, C],在C的每个维度上对[N, H, W]计算其均值、方差,用于该维度上的归一化操作。
Layer Norm
Layer Norm以样本为单位计算统计量,因此最后会得到N个u,δ。假设输入特征为[N, H, W, C],在N的每个维度上对[H, W,C]计算其均值、方差,用于该维度上的归一化操作。
Instance Norm
Group Norm