yandex-research / swarm

Official code for "SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient"
127 stars 16 forks source link

Amazing work ! #1

Open tchaton opened 1 year ago

tchaton commented 1 year ago

Dear Swarm Team,

Very interesting work ! I will dive into over the weekend.

BTW, I wondered if you knew about this paper: https://arxiv.org/pdf/2205.05040.pdf

Best, T.C

mryab commented 1 year ago

Hi, thanks for the kind words! Note that the codebase for SWARM in this repo is not fully complete, though I can show you the missing bits in a development branch over a quick call if you want to.

As for the paper you mentioned, is it correct to say that it mainly concerns the DP setting? Based on my limited understanding, it appears to be similar to Local SGD with local clipping for each step without averaging, but I probably need to study this a little bit more. Thanks for bringing this up!

tchaton commented 1 year ago

Hey @mryab,

Yes, I would love to have a call to learn more. My email address is thomas.chaton.ai@gmail.com. Send me an email so we can organise this ;) I would probably need a bit of time to read/understand as much as I can before then.

As for the paper you mentioned

Yes, this paper is similar to Local SGD with local clipping for each step without averaging. This kind of techniques could reduce communication in the DP dimension by averaging every N steps instead of 1 step and allowing diverging models in the meanwhile, but won't improve in the PP dimension though.

Best regards, Thomas Chaton.