Sequence Parallelism - Githubissues

Implement distributed attention in LightSeq, Colossal-AI, or DeepSpeed's SP.... We have not decided which one yet.

from pipegoose.nn.sequence_parallel.attention import DistributedAttention

local_attention = torch.nn.MultiheadAttention
attention = DistributedAttention(local_attention, parallel_context)
outputs = attention(q, k, v)

assert outputs == local_attention(q, k, v)

TODOs

[ ] Take all the Triton kernels from LightSeq, and structure them in a modular way. Do not directly call the kernel, but call through a middle-man function.
[ ] Sequence parallelism's scheduler.
[ ] Send and receive query, key.
[ ] Calculate local attention.
[ ] Obtain complete attention output.
[ ] Activation checkpointing

Reading

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models [[link]](https://arxiv.org/abs/2309.14509)
LightSeq Sequence Parallelism: https://arxiv.org/pdf/2310.03294v1.pdf

xrsrke / pipegoose

Sequence Parallelism #22