yandex / YaFSDP

YaFSDP: Yet another Fully Sharded Data Parallel
Apache License 2.0
824 stars 41 forks source link

Speed benchmarks vs FSDP2 #3

Open 152334H opened 3 months ago

152334H commented 3 months ago

Are the benchmarks conducted against FSDP or FSDP2?

see speed/memory differences

antony-frolov commented 3 months ago

The benchmarks were conducted against FSDP1, we used an early version of PyTorch 2.3.0 of November 2023 in our experiments

awgu commented 3 months ago

@antony-frolov Exciting project! Maybe it would help if you published some absolute performance numbers like tokens per second? I think right now I see % speedups only.

(Also, FSDP2 does extra device copies compared to FSDP1/YaFSDP, so we would not really expect FSDP2 to be faster.)

antony-frolov commented 3 months ago

@awgu thanks! just added absolute iteration time numbers for all the runs, hope that might help. though as measurements were done in a pretty vanilla distributed training setup (mostly for the ease of reproducibility) absolute numbers might not look too convincing when compared to frameworks more optimized for LLM pre-training

antony-frolov commented 3 months ago

@awgu here are the traces of Llama 2 34B runs on 256 devices with sequence length of 2048 for both FSDP and YaFSDP (these are the runs we compare in Advantages over FSDP section of the README).

llama-2-34b_256_2048_ya-fsdp.json llama-2-34b_256_2048_fsdp.json