Technical details - Githubissues

yayapa commented 3 years ago

Thank you for the great contribution! We are experimenting now with your implementation of QuantConv2d and trying to integrate it for object detection, namely into RetinaNet. Therefore, I would like to ask you some questions to validate my assumptions.

Unfortunately, the size of the model has not been changed after the applying of the 4-bit APoT Quantization. And as I can see, you used the same type of tensors , namely FP32, but inserted the quantized values? That means, that by saving of the model the Pytorch does not differ between full-precision model and your quantized implementation, because the tensors have the same type( in some source it is called fake quantization)? Or is there any tools in your code, which sets another type, for quantized weights, in order to reduce the model size by saving?
Because of the problem above, I wanted to transfer your weight tensors to dtype=torch.qint8, but I found out that the weights of the quantized layers are not integer even in binary and uniform Quantization. Does it mean, that when you say you have e.g. 4-bit uniform quantization it does not mean that the weights are integer and in interval [0, 2^4-1] (or another clipping interval). It does mean that all weights can be represented with 4-bits but they are not necessary integer?

You can directly point out, if I wrote something wrong in my assumptions.

Best regards,

yhhhli commented 3 years ago

Hi,

thanks for your question. Most pytorch implementation indeed can only conduct fake quantization and cannot bring memory space reduction in pure pytorch settings. To reduce the model size, you need more advanced tools, such as TensorRT, which is a platform for model deployment.

Regarding your second question, the weights stored in the state_dict() are full-precision. They will be quantized during forwarding pass.

yayapa commented 3 years ago

Thank you for the answer!

yhhhli / APoT_Quantization

Technical details #14