woodenchild95 / FL-Simulator

Pytorch implementations of some general federated optimization methods.
34 stars 5 forks source link

Problem of landscape and hessian #9

Open harrylee999 opened 8 months ago

harrylee999 commented 8 months ago

Hey, sorry to bother you again(`◕‸◕´+) I have questions about reproducing the landscape and hessian in the FedSMOO paper. I use the code of "Visualizing the Loss Landscape of Neural Nets. NIPS, 2018." to plot loss landscape, the code of "Local Learning Matters: Rethinking Data Heterogeneity in Federated Learning [CVPR 2022]" to compute Hessian top eigenvalue and trace, and this repository to run FedAvg, FedSAM,FedDyn, FedSpeed and FedSMOO.

I find FedSAM have a lower Hessian value and a more flat loss landscape than FedAvg.

However, i also find FedDyn produce higher Hessian value than FedAvg, resulting in FedSpeed and FedSMOO also have higher Hessian value than FedSAM.

This is reflected in the loss landscape that: FedSpeed and FedSMOO are both sharper than FedSAM although they have a small loss at coordinate(0,0).

This seems very anomalous (I also can't sleep well because of this Ծ‸Ծ)

Is something missed or wrong? Could you provide some detail to reproduce the result in the FedSMOO paper?

woodenchild95 commented 8 months ago

@harrylee999 It sounds queer. Improvements in FedSAM are poor which also has been verified in the FedSAM work. Only when aligned with each other by the variant MoFedSAM it can achieve nearly SOTA. Therefore I can observe the similar loss landscape between FedSAM and FedAvg. Generally speaking, SAM does not reduce the loss. In fact, you can see that the training loss with SAM is always larger because a smaller loss leads to overfitting. However, high generalization avoids overfitting. Another indicator is the Hessian distribution. The top-1 value of hessian may sometimes work. But the more practical one is its distribution of all eigenvalues. Because sometimes there still be invalid parameters. Smoothness means it is averaged smooth on each direction of the gradient. You'd better provide the full distribution of the eigenvalues of Hessian or some inexact alternatives for analysis.

And could you please provide the details of the implementation? From these descriptions I can not judge which link may cause problems. The phenomenon on the loss and hessian are both strange.