Hi, currently our implementation is fake_quantization, the quantized weights are still FP32 format in GPU, but they are restricted to several fixed numbers. To enable real quantization on GPU, it requires specific cuda implementation for low-bit computation, which is not necessary for researchers who just want to test their algorithm.
Hi, currently our implementation is fake_quantization, the quantized weights are still FP32 format in GPU, but they are restricted to several fixed numbers. To enable real quantization on GPU, it requires specific cuda implementation for low-bit computation, which is not necessary for researchers who just want to test their algorithm.