Closed qiulesun closed 6 years ago
0.3.1 is way too old. Please install PyTorch master branch > 0.5.0
The version of python and torch are updated to 3.6 and 0.4.0 respectively. Follow the link you provided https://www.claudiokuenzler.com/blog/756/install-newer-ninja-build-tools-ubuntu-14.04-trusty#.WxYrvFMvzJw, I install ninja 1.8.2. However, when I run again the quick demo http://hangzh.com/PyTorch-Encoding/experiments/segmentation.html#install-package, I got another error. How can I solve it? I believe your papers and code can make me interested in semantic segmentation tasks.
root@hh-Z97X-UD3H:/media/hh/0bfd0eaf-cf46-48b3-915a-aa317b67d9ec/PyTorch-Encoding/PyTorch-Encoding-master# python quick_demo.py Traceback (most recent call last): File "/usr/anaconda3/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 576, in _build_extension_module ['ninja', '-v'], stderr=subprocess.STDOUT, cwd=build_directory) File "/usr/anaconda3/lib/python3.6/subprocess.py", line 336, in check_output **kwargs).stdout File "/usr/anaconda3/lib/python3.6/subprocess.py", line 418, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "demo.py", line 2, in
This package depend on a slightly higher version than PyTroch 0.4.0. Please follow the instructions to install pytorch from source https://github.com/pytorch/pytorch#from-source
In your paper, the sentence ''The ground truth labels for SE-loss are generated by “unique” operation finding the categories presented in the given ground-truth segmentation mask.'' means that every input image has multiple labels. As far as I know, the binary cross entroy loss can handle binary class or multi-class task rather than multi-labels.
I didn’t get the difference between multi class and multi labels. Could you please explain in detail? Btw, the NN already has sigmoid activation
Multiclass classification means a classification task with more than two classes; e.g., classify a set of images of fruits which may be oranges, apples, or pears. Multiclass classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.
Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.
I note that the NN has sigmoid activation. I hold the question that, in your case, the input image has multiple labels or one.
The presence of the object categories is indeed a multi-label task. Each category is predicted independently using a binary prediction. I hope it can address your concern.
Please refer to the docs for binary cross entropy loss https://pytorch.org/docs/stable/nn.html?highlight=bceloss#torch.nn.BCELoss
In binary classification, the number of classes equals 2. The object categories in an input image are more than 2 (figure 2 in paper). So I don't understand why binary cross entropy loss is empolyed and ''Each category is predicted independently using a binary prediction. ''
Each category is a binary classification problem. For 150 categories, there 150 individual binary classification problem. I hope this explanation is clear enough. If you still have difficulties, feel free to ask questions in Chinese.
Thank you for your patience. Your explanation is clear. The binary cross entropy loss can handle the multi-label classification task. Its target is something like [1,0,0,1,0...]. Sigmoid, unlike softmax don't give probability distribution around NCLASS as output, but independent probabilities.
You’re welcome. That is correct.
I am really sorroy for disturbing you again. I shouldn't ask the question about installation PyTorch from source, but I have no idea to solve it. Can you help me to fix it out?
System Info:
How you installed PyTorch (conda, pip, source): source Build command you used (if compiling from source): python setup.py install OS: ubuntu14.04 PyTorch version: master Python version: 3.6 CUDA/cuDNN version: cuda8.0+cudnn5.0 GPU models and configuration: GTX1080Ti GCC version (if compiling from source): 4.9.4 CMake version: 3.7.2 ############################################################ Issue description:
3 errors detected in the compilation of "/tmp/tmpxft_00002a14_00000000-7_THCTensorMath.cpp1.ii". CMake Error at caffe2_gpu_generated_THCTensorMath.cu.o.Release.cmake:279 (message): Error generating file /media/hh/pytorch_dir/pytorch/build/caffe2/CMakeFiles/caffe2_gpu.dir/__/aten/src/THC/./caffe2_gpu_generated_THCTensorMath.cu.o
make[2]: [caffe2/CMakeFiles/caffe2_gpu.dir/__/aten/src/THC/caffe2_gpu_generated_THCTensorMath.cu.o] Error 1 make[1]: [caffe2/CMakeFiles/caffe2_gpu.dir/all] Error 2 make: *** [all] Error 2 Failed to run 'bash tools/build_pytorch_libs.sh --use-cuda --use-nnpack --use-mkldnn nccl caffe2 nanopb libshm gloo THD c10d'
Try install the dependencies as following first:
export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" # [anaconda root directory]
# Install basic dependencies
conda install numpy pyyaml mkl mkl-include setuptools cmake cffi typing
conda install -c mingfeima mkldnn
# Add LAPACK support for the GPU
conda install -c pytorch magma-cuda80 # or magma-cuda90 if CUDA 9
You may want to ask on PyTorch repo for further help
Are the models you released (model_zoo.py) all trained with two Context Encoding Modules? Can you detail the MS evaluation in the table 1?
models = {
'encnet_resnet50_pcontext': get_encnet_resnet50_pcontext,
'encnet_resnet101_pcontext': get_encnet_resnet101_pcontext,
'encnet_resnet50_ade': get_encnet_resnet50_ade,
}
We only use one Context Encoding Module now, which is more efficient and makes the model compatible with EncNetV2.
Can Ubuntu, Mac and Windows os all run the released codes?
It mainly depends on the PyTorch. If the pytorch is compiled successfully on your system, there won't be a problem. I am using both Mac and Ubuntu. Note that PyTorch master branch is required.
The comand (e.g., CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --dataset PContext --model EncNet --aux --se-loss --backbone resnet101) for training the model means training resnet101 from scratch or finetuning resnet101?
resnet101 is pretrained from ImageNet.
I used the comand (CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --dataset PContext --model EncNet --aux --se-loss) for training the model resnet50. However, when it ran to the epoch12, I stopped it. Next, I restart it and find unluckily it has ran from epoch0 rather than epoch12. What should I do to run it from epoch12?
Please resume by adding command --resume path/to/checkpoint.pth.tar
Thank you. I have another interest. When does PyTroch 0.4.0 meets the requirements of running released code ?
This package won't be compatible with PyTroch 0.4.0, but it will be compatible with next stable release.
Question about selayer, why does the selayer have no sigmoid activation function?
(encmodule): EncModule( (encoding): Sequential( (0): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace) (3): Encoding(N x 512=>32x512) (4): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace) (6): Mean() ) (fc): Sequential( (0): Linear(in_features=512, out_features=512, bias=True) (1): Sigmoid() ) (selayer): Linear(in_features=512, out_features=59, bias=True) )
That is the prediction layer for minimizing SE-Loss.
The sigmod
function is applied during the loss calculation https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/encoding/nn/customize.py#L65
Sorry for bothering you agian, I have no idea with next errors when I run CUDA_VISIBLE_DEVICES=0,1 python train.py --dataset pcontext --model encnet --aux --se-loss. And import encoding gets similar errors.
OS: ubuntu14.04 Pytorch version: 0.5.0 (from source) Python version: 3.6 CUDA: 8.0 cudnn: 6.0.21 GPU: 2 1080
/usr/local/anaconda3/bin/python3.6 /media/cv-pc-00/QL_480G/sql/pytorch_dir/PyTorch-Encoding/experiments/segmentation/train.py --dataset PContext --model EncNet --se-loss —————————————————————————————————————————————— Traceback (most recent call last): File "/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 742, in _build_extension_module ['ninja', '-v'], stderr=subprocess.STDOUT, cwd=build_directory) File "/usr/local/anaconda3/lib/python3.6/subprocess.py", line 336, in check_output **kwargs).stdout File "/usr/local/anaconda3/lib/python3.6/subprocess.py", line 418, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/media/cv-pc-00/QL_480G/sql/pytorch_dir/PyTorch-Encoding/experiments/segmentation/train.py", line 17, in
/usr/local/anaconda3/lib/python3.6/site-packages/encoding/lib/gpu/roi_align_kernel.cu(373): error: class "at::Context" has no member "getCurrentCUDAStream"
/usr/local/anaconda3/lib/python3.6/site-packages/encoding/lib/gpu/roi_align_kernel.cu(420): error: class "at::Context" has no member "getCurrentCUDAStream"
/usr/local/anaconda3/lib/python3.6/site-packages/encoding/lib/gpu/roi_align_kernel.cu(420): error: class "at::Context" has no member "getCurrentCUDAStream"
4 errors detected in the compilation of "/tmp/tmpxft_0000662c_00000000-7_roi_align_kernel.cpp1.ii". [2/4] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=enclib_gpu -I/usr/local/anaconda3/lib/python3.6/site-packages/torch/lib/include -I/usr/local/anaconda3/lib/python3.6/site-packages/torch/lib/include/TH -I/usr/local/anaconda3/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/usr/local/anaconda3/include/python3.6m --compiler-options '-fPIC' -std=c++11 -c /usr/local/anaconda3/lib/python3.6/site-packages/encoding/lib/gpu/encoding_kernel.cu -o encoding_kernel.cuda.o FAILED: encoding_kernel.cuda.o /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=enclib_gpu -I/usr/local/anaconda3/lib/python3.6/site-packages/torch/lib/include -I/usr/local/anaconda3/lib/python3.6/site-packages/torch/lib/include/TH -I/usr/local/anaconda3/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/usr/local/anaconda3/include/python3.6m --compiler-options '-fPIC' -std=c++11 -c /usr/local/anaconda3/lib/python3.6/site-packages/encoding/lib/gpu/encoding_kernel.cu -o encoding_kernel.cuda.o nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). /usr/local/anaconda3/lib/python3.6/site-packages/encoding/lib/gpu/encoding_kernel.cu(315): error: class "at::Context" has no member "getCurrentCUDAStream"
/usr/local/anaconda3/lib/python3.6/site-packages/encoding/lib/gpu/encoding_kernel.cu(341): error: class "at::Context" has no member "getCurrentCUDAStream"
/usr/local/anaconda3/lib/python3.6/site-packages/encoding/lib/gpu/encoding_kernel.cu(364): error: class "at::Context" has no member "getCurrentCUDAStream"
/usr/local/anaconda3/lib/python3.6/site-packages/encoding/lib/gpu/encoding_kernel.cu(391): error: class "at::Context" has no member "getCurrentCUDAStream"
4 errors detected in the compilation of "/tmp/tmpxft_00006623_00000000-7_encoding_kernel.cpp1.ii". [3/4] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=enclib_gpu -I/usr/local/anaconda3/lib/python3.6/site-packages/torch/lib/include -I/usr/local/anaconda3/lib/python3.6/site-packages/torch/lib/include/TH -I/usr/local/anaconda3/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/usr/local/anaconda3/include/python3.6m --compiler-options '-fPIC' -std=c++11 -c /usr/local/anaconda3/lib/python3.6/site-packages/encoding/lib/gpu/syncbn_kernel.cu -o syncbn_kernel.cuda.o FAILED: syncbn_kernel.cuda.o /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=enclib_gpu -I/usr/local/anaconda3/lib/python3.6/site-packages/torch/lib/include -I/usr/local/anaconda3/lib/python3.6/site-packages/torch/lib/include/TH -I/usr/local/anaconda3/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/usr/local/anaconda3/include/python3.6m --compiler-options '-fPIC' -std=c++11 -c /usr/local/anaconda3/lib/python3.6/site-packages/encoding/lib/gpu/syncbn_kernel.cu -o syncbn_kernel.cuda.o nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). /usr/local/anaconda3/lib/python3.6/site-packages/encoding/lib/gpu/syncbn_kernel.cu(183): error: class "at::Context" has no member "getCurrentCUDAStream"
/usr/local/anaconda3/lib/python3.6/site-packages/encoding/lib/gpu/syncbn_kernel.cu(217): error: class "at::Context" has no member "getCurrentCUDAStream"
/usr/local/anaconda3/lib/python3.6/site-packages/encoding/lib/gpu/syncbn_kernel.cu(249): error: class "at::Context" has no member "getCurrentCUDAStream"
/usr/local/anaconda3/lib/python3.6/site-packages/encoding/lib/gpu/syncbn_kernel.cu(272): error: class "at::Context" has no member "getCurrentCUDAStream"
4 errors detected in the compilation of "/tmp/tmpxft_00006627_00000000-7_syncbn_kernel.cpp1.ii". ninja: build stopped: subcommand failed.
Process finished with exit code 1
Hi, That is because the PyTorch updates in backend.
at::Context:: getCurrentCUDAStream
to cudaStream_t stream = at::cuda::getCurrentCUDAStream();
#include <ATen/cuda/CUDAContext.h>
This will be fixed in next version.
Thanks for your attention. It does work! However, three warnings occur, do that matter?
/usr/local/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py:1940: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead. warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.")
/usr/local/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py:1025: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead. warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
/usr/local/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='elementwise_mean' instead. warnings.warn(warning.format(ret))
The deprecate warning is okay for now.
Problem with debugging the backward method of Function class
Hi, aggregate(A, X, C) and scaledL2(X, C, S) in encoding.functions.encoding.py implement the forward and backwark of your custom function. I want to debug their forward and backwark and the pycharm-community-2018.1.4 I used on Ubuntu 16.04 LTS has allowed me debug the forward step by step. However, I could not debug backward function like forward equipped with 2 1080 GPU. Could you tell me is it possilbe and how to address it? (ps: for my own custom functions based on your codes, I also face the same problem)
You can directly call the backend function for debugging https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/encoding/functions/encoding.py#L77
For my special case, I want to run the codes with one GPU (ps: my machine is equipped with 2 GPUs), for example debugging the codes, etc. Do the codes support a single GPU operation even if the machine is equipped with 2 GPUs? Is the default multi GPU running if the machine is equipped with multiple GPUs?
CUDA_VISIBLE_DEVICES=0 python train.py ...
Question 1 I use pycharm-community-2018.1.4 to make it easier to debug the codes and CUDA_VISIBLE_DEVICES=0 --dataset PContext --model EncNet --se-loss is given in debug configurations. However, I get the error train.py: error: unrecognized arguments: CUDA_VISIBLE_DEVICES=0 When I use the pycharm-community-2018.1.4 to debug the codes with a single GPU, I should do what next ?
Connected to pydev debugger (build 181.5087.37) usage: train.py [-h] [--model MODEL] [--backbone BACKBONE] [--dataset DATASET] [--data-folder DATA_FOLDER] [--workers N] [--aux] [--se-loss] [--epochs N] [--start_epoch N] [--batch-size N] [--test-batch-size N] [--lr LR] [--lr-scheduler LR_SCHEDULER] [--momentum M] [--weight-decay M] [--no-cuda] [--seed S] [--resume RESUME] [--checkname CHECKNAME] [--model-zoo MODEL_ZOO] [--ft] [--pre-class PRE_CLASS] [--ema] [--eval] [--no-val] [--test-folder TEST_FOLDER] train.py: error: unrecognized arguments: CUDA_VISIBLE_DEVICES=0
Question 2 args.lr = lrs[args.dataset.lower()] / 16 * args.batch_size in option.py means that the lr is relate to batch_size you give. Is that the lr not fixed depending on the batch_size (GPU memory)? In my experiments, I set the args.lr = lrs[args.dataset.lower()], is it reasonable and feasible, does it respect your paper and intentions?
Question 3 For multi-size evaluation, the 27th line base_size=576, crop_size=608 (base_size less than crop_size) in encoding/models/base.py should be base_size=608, crop_size=576? Previously, you set base_size=520, crop_size=480 and now you change them to base_size=576, crop_size=608. I hold the view that crop_size less than base_size seems reasonable. What settings should I follow to reproduce your results?
I am looking forward to your reply.
Q1: please use the terminal to launch the program. Q2: That is a kind of standard setting for LR. When increasing the batch size, people typically increase the LR accordingly. Q3. That is a bug. It will be fixed in next release.
For the Q2 above, due to the limited GPU memory, the batch size has to be small (typically less than 16) unfortunately. It means that I have to use smaller LR according to the standard setting, i.e., args.lr = lrs[args.dataset.lower()] / 16 * args.batch_size ?
Yes. If the batch size is too small, the model will get worse result, because the working batch size for batch normalization is small.
I only have 2 1080 GPUs with a total of 16G memory. The batch size is small less than 16 in my experiments. Can I alleviate this side effect (the model will get worse result you said) by using larger LR and set args.lr = lrs[args.dataset.lower()], independent of batch size?
The batch size matters for segmentation task, due to working batch size for the Synchronize Batch Normalization. For batch size =16 yields the best performance.
What is the main difference between encoding.nn.BatchNorm1d and encoding.nn.BatchNorm2d?
same as torch.nn.BatchNorm1d
and torch.nn.BatchNorm2d
I have two questions. (1) For cos ans poly lr schedules, every batch (iter) has a different lr rather than them in one epoch has same lr. Is that right? (2) For cifar10 recognition, the scaling factor s_k is not learnt but randomly sampling from a uniform distribution between 0 and 1, which is different from segmentation tasks. Is that right?
I'm sorry for disturbing you again. Your work is very encouraging to me. I notice that the scaled_l2 and aggregate opertors of the proposed encoding layer are implemented by C++ language. Duo to I am not good at it, could you share the corresponding implementation using python code if you want?
We change LR every iter. The cifar experiment use shake-out like regularization. Scaled L2 and aggregate are easy to implement in python, but that will be memory consuming.
question 1: Sorry to ask the stupid question. The augmented pascal voc 2012 has 11533 images in trainval.txt rather than 10582 used in paper. It's troubled me. And I do not get the information about how to augment the 1464 trainging images of pascal voc 2012 to result in 10582 ones. In other words, I do not get the relationship between the pascal voc 2012 and its augmented version. Could I fortunately know your opinion? If you think this question is not worth answering, I can understand completely.
question 2: As far as I known, Group norm (https://arxiv.org/pdf/1803.08494.pdf) is independent of batch size, much suitable for semantic segmentation task, which requires small batches constrained by memory consumption. Could you consider employing it in your updated version?
Q1. For VOC experiments, first pretrained on COCO, then finetune on "pascal_aug" and finally on "pascal_voc". I am releasing the training detail for reproducing VOC experiments this weekend. Q2. Group Norm still has inferior performance comparing to BN. You can easily use that by changing the code a little bit.
Question 1: I see base_size=608 and crop_size=576 in the training log of EncNet_ResNet50_ADE, (https://raw.githubusercontent.com/zhanghang1989/image-data/master/encoding/segmentation/logs/encnet_resnet50_ade.log), however, the base_size and crop_size are set to 520 and 480 respectively in https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/encoding/datasets/base.py#L17. It's troubled me. Does the special case for ADE20K use base_size=608 and crop_size=576 and use base_size=520 and crop_size=480 for PASCAL Context and PASCAL VOC12 ? Question 2: Besides, base_size=576 and crop_size=608 in https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/encoding/models/base.py#L27 is only to multiscale test ?
There are some bugs in existing code. I am updating them soon.
Question 1: As mentioned above, there are some bugs in existing code. I still have a question. The EncNet_ResNet50_ADE achieves 79.9 pixAcc and 41.2 mIoU at the last row in the table (https://hangzhang.org/PyTorch-Encoding/experiments/segmentation.html), however, from the training log file (https://raw.githubusercontent.com/zhanghang1989/image-data/master/encoding/segmentation/logs/encnet_resnet50_ade.log) I see that it obtains 78.0 pixAcc and 40.2 mIoU lower than the results you reported. Is this because you use the multi-scale testing strategy on ADE20K val set? Or something else ?
Hi, I got the error named No module named cpp_extension (from torch.utils.cpp_extension import load) when I run the quick demo http://hangzh.com/PyTorch-Encoding/experiments/segmentation.html#install-package. The version of python and torch are 2.7 and 0.3.1 respectively. How can I handle it?