Open 152334H opened 3 months ago
The benchmarks were conducted against FSDP1, we used an early version of PyTorch 2.3.0 of November 2023 in our experiments
@antony-frolov Exciting project! Maybe it would help if you published some absolute performance numbers like tokens per second? I think right now I see % speedups only.
(Also, FSDP2 does extra device copies compared to FSDP1/YaFSDP, so we would not really expect FSDP2 to be faster.)
@awgu thanks! just added absolute iteration time numbers for all the runs, hope that might help. though as measurements were done in a pretty vanilla distributed training setup (mostly for the ease of reproducibility) absolute numbers might not look too convincing when compared to frameworks more optimized for LLM pre-training
@awgu here are the traces of Llama 2 34B runs on 256 devices with sequence length of 2048 for both FSDP and YaFSDP (these are the runs we compare in Advantages over FSDP section of the README).
llama-2-34b_256_2048_ya-fsdp.json llama-2-34b_256_2048_fsdp.json
Are the benchmarks conducted against FSDP or FSDP2?
see speed/memory differences