Closed Lain810 closed 2 years ago
The performance of FedAvg is not as good as FedGen simply because the Trainloader does not have a shuffle. After fixing the bugs Fedgen is not as effective as Fedavg.
Thanks for your interest in this repo. Could you let me know which dataset and which data loader you were referring to?
userbase.py line 32 'self.trainloader = DataLoader(train_data, self.batch_size, drop_last=False)' is wrong, and should be corrected to 'self.trainloader = DataLoader(train_data, self.batch_size, drop_last=False, shuffle=True)'. This bug reduces the performance of Fedavg.
Thanks. I was looking at the same place and re-running the experiments.
After specifically adding shuffle=True
for the train dataloder:
https://github.com/zhuangdizhu/FedGen/blob/7c6441b95b8e49443e7a4c597bd8982d1e3ee233/FLAlgorithms/users/userbase.py#L32
I got the updated accuracy for both algorithms
Algorithm: FedAvg , Accuracy = 90.71 %, deviation = 1.17 Algorithm: FedGen , Accuracy = 95.68 %, deviation = 0.20
I am using the following script to generate the results:
python main.py --dataset Mnist-alpha0.1-ratio0.5 --algorithm FedAvg --batch_size 32 --num_glob_iters 200 --local_epochs 20 --num_users 10 --lamda 1 --learning_rate 0.01 --model cnn --personal_learning_rate 0.01 --times 3
python main.py --dataset Mnist-alpha0.1-ratio0.5 --algorithm FedGen --batch_size 32 --num_glob_iters 200 --local_epochs 20 --num_users 10 --lamda 1 --learning_rate 0.01 --model cnn --personal_learning_rate 0.01 --times 3
And for plotting the results:
python main_plot.py --dataset Mnist-alpha0.1-ratio0.5 --algorithms FedAvg,FedGen --batch_size 32 --local_epochs 20 --num_users 10 --num_glob_iters 200 --plot_legend 1
The current results indicate that FedGen still performs obviously better after setting shuffle=True, although both algorithms got improved.
I might need to update the results for other dataset settings as well with shuffle=True, but based on the current result, I expect the new experiments will not change the major conclusion of the paper.
Not handling the BN layer parameters correctly can also have a significant impact on FedAvg. You can try using LeNet without the BN layer, the most common Mnist baseline, with an accuracy of at least 95%.
The testing part of the code is also strange, why is it tested on “selected user”, shouldn't the global accuracy be tested on the global test set. And the code seems to run only on the cpu.
Not handling the BN layer parameters correctly can also have a significant impact on FedAvg. You can try using LeNet without the BN layer, the most common Mnist baseline, with an accuracy of at least 95%.
Thanks for the pointer to the benchmark model. Handling BN layers is another intriguing topic in FL. Regarding this specific project, we were mainly focusing on data heterogeneity and hence applied the same architecture for all algorithms. FL without BN layers may lead to changes in the Fagen generator configuration as well, which is something I would explore once I have more bandwidth.
The testing part of the code is also strange, why is it tested on “selected user”, shouldn't the global accuracy be tested on the global test set. And the code seems to run only on the cpu.
To make it clear, we tested on ALL test dataset with selected = False, https://github.com/zhuangdizhu/FedGen/blob/0bfd4e1209e23ec843b40f56272d61a16ea86246/FLAlgorithms/servers/serverbase.py#L224
which means all users will be selected for evaluation. We UNIFORMLY and EVENLY distributed the TEST dataset to all users, therefore it is the same as performing evaluation on the global test dataset. This is the same strategy adopted by other FL implementations, e.g. this one https://github.com/CharlieDinh/pFedMe.
The performance of FedAvg is not as good as FedGen simply because the Trainloader does not have a shuffle. After fixing the bugs Fedgen is not as effective as Fedavg.