zhuangdizhu / FedGen

Code and data accompanying the FedGen paper
233 stars 68 forks source link

Trainloader is not shuffle #7

Closed Lain810 closed 2 years ago

Lain810 commented 2 years ago

The performance of FedAvg is not as good as FedGen simply because the Trainloader does not have a shuffle. After fixing the bugs Fedgen is not as effective as Fedavg.

zhuangdizhu commented 2 years ago

The performance of FedAvg is not as good as FedGen simply because the Trainloader does not have a shuffle. After fixing the bugs Fedgen is not as effective as Fedavg.

Thanks for your interest in this repo. Could you let me know which dataset and which data loader you were referring to?

Lain810 commented 2 years ago

userbase.py line 32 'self.trainloader = DataLoader(train_data, self.batch_size, drop_last=False)' is wrong, and should be corrected to 'self.trainloader = DataLoader(train_data, self.batch_size, drop_last=False, shuffle=True)'. This bug reduces the performance of Fedavg.

zhuangdizhu commented 2 years ago

Thanks. I was looking at the same place and re-running the experiments.

After specifically adding shuffle=True for the train dataloder: https://github.com/zhuangdizhu/FedGen/blob/7c6441b95b8e49443e7a4c597bd8982d1e3ee233/FLAlgorithms/users/userbase.py#L32 I got the updated accuracy for both algorithms

Algorithm: FedAvg , Accuracy = 90.71 %, deviation = 1.17 Algorithm: FedGen , Accuracy = 95.68 %, deviation = 0.20

I am using the following script to generate the results:

python main.py --dataset Mnist-alpha0.1-ratio0.5 --algorithm FedAvg --batch_size 32 --num_glob_iters 200 --local_epochs 20 --num_users 10 --lamda 1 --learning_rate 0.01 --model cnn --personal_learning_rate 0.01 --times 3

python main.py --dataset Mnist-alpha0.1-ratio0.5 --algorithm FedGen --batch_size 32 --num_glob_iters 200 --local_epochs 20 --num_users 10 --lamda 1 --learning_rate 0.01 --model cnn --personal_learning_rate 0.01 --times 3

And for plotting the results:

python main_plot.py --dataset Mnist-alpha0.1-ratio0.5 --algorithms FedAvg,FedGen --batch_size 32 --local_epochs 20 --num_users 10 --num_glob_iters 200 --plot_legend 1

The current results indicate that FedGen still performs obviously better after setting shuffle=True, although both algorithms got improved.

I might need to update the results for other dataset settings as well with shuffle=True, but based on the current result, I expect the new experiments will not change the major conclusion of the paper.

Lain810 commented 2 years ago

Not handling the BN layer parameters correctly can also have a significant impact on FedAvg. You can try using LeNet without the BN layer, the most common Mnist baseline, with an accuracy of at least 95%.

Lain810 commented 2 years ago

The testing part of the code is also strange, why is it tested on “selected user”, shouldn't the global accuracy be tested on the global test set. And the code seems to run only on the cpu.

zhuangdizhu commented 2 years ago

Not handling the BN layer parameters correctly can also have a significant impact on FedAvg. You can try using LeNet without the BN layer, the most common Mnist baseline, with an accuracy of at least 95%.

Thanks for the pointer to the benchmark model. Handling BN layers is another intriguing topic in FL. Regarding this specific project, we were mainly focusing on data heterogeneity and hence applied the same architecture for all algorithms. FL without BN layers may lead to changes in the Fagen generator configuration as well, which is something I would explore once I have more bandwidth.

zhuangdizhu commented 2 years ago

The testing part of the code is also strange, why is it tested on “selected user”, shouldn't the global accuracy be tested on the global test set. And the code seems to run only on the cpu.

To make it clear, we tested on ALL test dataset with selected = False, https://github.com/zhuangdizhu/FedGen/blob/0bfd4e1209e23ec843b40f56272d61a16ea86246/FLAlgorithms/servers/serverbase.py#L224

which means all users will be selected for evaluation. We UNIFORMLY and EVENLY distributed the TEST dataset to all users, therefore it is the same as performing evaluation on the global test dataset. This is the same strategy adopted by other FL implementations, e.g. this one https://github.com/CharlieDinh/pFedMe.