xrsrke / pipegoose

Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*
MIT License
75 stars 17 forks source link

Sequence Parallelism #22

Open xrsrke opened 9 months ago

xrsrke commented 9 months ago

Implement distributed attention in LightSeq, Colossal-AI, or DeepSpeed's SP.... We have not decided which one yet.

from pipegoose.nn.sequence_parallel.attention import DistributedAttention

local_attention = torch.nn.MultiheadAttention
attention = DistributedAttention(local_attention, parallel_context)
outputs = attention(q, k, v)

assert outputs == local_attention(q, k, v)

TODOs

Reading

3outeille commented 8 months ago

on it