Closed creaitr closed 4 years ago
Hi,
The difference comes from the hyper-parameters. For clipping thresholds of weights or activation, they actually need a different learning rate and weight decay from the network weights. We tune these hyper-parameters a lot to find a better result. In this repo, we just use the same hyper-params for both alpha and the network weights
Thank you for your clear answer!
Hi, I have a question for the reported accuracy.
For example, you got 70.75% and 66.46% with 5 bits and 2 bits for ResNet-18 on ImageNet, respectively.
In the paper, however, 70.9% and 67.3% with 5 bits and 2 bits are reported.
Can you explain what has made these differences?