zhutmost / lsq-net

Unofficial implementation of LSQ-Net, a neural network quantization framework
MIT License
277 stars 40 forks source link

How to fix this bug when I try to train resnet20 for cifar-10 #29

Open BodongDu opened 6 months ago

BodongDu commented 6 months ago

(flashatt) yangyk@yyk-s1:~/yangyk/NN_CUDA/lsq/lsq-net$ python main.py ./examples/lsq/resnet20_a2w2_cifar10.yaml /home/yangyk/yangyk/NN_CUDA/lsq/lsq-net <class 'pathlib.PosixPath'> INFO - Log file for this run: /home/yangyk/yangyk/NN_CUDA/lsq/lsq-net/out/resnet20_a2w2_cifar10_20240531-171941/resnet20_a2w2_cifar10_20240531-171941.log INFO - TensorBoard data directory: /home/yangyk/yangyk/NN_CUDA/lsq/lsq-net/out/resnet20_a2w2_cifar10_20240531-171941/tb_runs Files already downloaded and verified Files already downloaded and verified /home/yangyk/anaconda3/envs/flashatt/lib/python3.8/site-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 32 worker processes in total. Our suggested max number of worker in current system is 16, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( INFO - Dataset cifar10 size: Training Set = 50000 (196) Validation Set = 10000 (40) Test Set = 10000 (40) INFO - Created resnet20 model for cifar10 dataset Use pre-trained model = True tensor(8) Traceback (most recent call last): File "main.py", line 120, in main() File "main.py", line 59, in main tbmonitor.writer.add_graph(model, input_to_model=train_loader.dataset[0][0].unsqueeze(0)) File "/home/yangyk/anaconda3/envs/flashatt/lib/python3.8/site-packages/torch/utils/tensorboard/writer.py", line 841, in add_graph graph(model, input_to_model, verbose, use_strict_trace) File "/home/yangyk/anaconda3/envs/flashatt/lib/python3.8/site-packages/torch/utils/tensorboard/_pytorch_graph.py", line 337, in graph trace = torch.jit.trace(model, args, strict=use_strict_trace) File "/home/yangyk/anaconda3/envs/flashatt/lib/python3.8/site-packages/torch/jit/_trace.py", line 794, in trace return trace_module( File "/home/yangyk/anaconda3/envs/flashatt/lib/python3.8/site-packages/torch/jit/_trace.py", line 1056, in trace_module module._c._create_method_from_trace( File "/home/yangyk/anaconda3/envs/flashatt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/home/yangyk/anaconda3/envs/flashatt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1488, in _slow_forward result = self.forward(input, **kwargs) File "/home/yangyk/yangyk/NN_CUDA/lsq/lsq-net/model/resnet_cifar.py", line 125, in forward out = F.avg_pool2d(out, kernel_size=out.size()[3]) TypeError: avg_pool2d(): argument 'kernel_size' must be tuple of ints, not Tensor

How to fix this bug when I try to train resnet20 for cifar-10

BodongDu commented 6 months ago

INFO - >>>>>>>> Epoch -1 (pre-trained model evaluation) INFO - Validation: 10000 samples (256 per mini-batch) 8 torch.Size([256, 64]) torch.Size([256, 10]) Traceback (most recent call last): File "main.py", line 120, in main() File "main.py", line 94, in main top1, top5, _ = process.validate(val_loader, model, criterion, File "/home/yangyk/yangyk/NN_CUDA/lsq/lsq-net/process.py", line 104, in validate acc1, acc5 = accuracy(outputs.data, targets.data, topk=(1, 5)) File "/home/yangyk/yangyk/NN_CUDA/lsq/lsq-net/process.py", line 27, in accuracy correct_k = correct[:k].view(-1).float().sum(0, keepdim=True) RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead. And also this bug