Open tchaton opened 1 year ago
Hi, thanks for the kind words! Note that the codebase for SWARM in this repo is not fully complete, though I can show you the missing bits in a development branch over a quick call if you want to.
As for the paper you mentioned, is it correct to say that it mainly concerns the DP setting? Based on my limited understanding, it appears to be similar to Local SGD with local clipping for each step without averaging, but I probably need to study this a little bit more. Thanks for bringing this up!
Hey @mryab,
Yes, I would love to have a call to learn more. My email address is thomas.chaton.ai@gmail.com. Send me an email so we can organise this ;) I would probably need a bit of time to read/understand as much as I can before then.
As for the paper you mentioned
Yes, this paper is similar to Local SGD with local clipping for each step without averaging. This kind of techniques could reduce communication in the DP dimension by averaging every N steps instead of 1 step and allowing diverging models in the meanwhile, but won't improve in the PP dimension though.
Best regards, Thomas Chaton.
Dear Swarm Team,
Very interesting work ! I will dive into over the weekend.
BTW, I wondered if you knew about this paper: https://arxiv.org/pdf/2205.05040.pdf
Best, T.C