watertianyi commented 4 years ago

kins_snake Traceback (most recent call last): File "train_net.py", line 54, in main() File "train_net.py", line 46, in main network = make_network(cfg) File "/data/hongjq/object_detect/snake/lib/networks/make_network.py", line 23, in make_network return imp.load_source(module, path).get_network(cfg) File "lib/networks/snake/init.py", line 17, in get_network network = get_model(num_layers, heads, head_conv, snake_config.down_ratio, cfg.det_dir) File "lib/networks/snake/ct_snake.py", line 64, in get_network network = Network(num_layers, heads, head_conv, down_ratio, det_dir) File "lib/networks/snake/ct_snake.py", line 19, in init head_conv=head_conv) File "lib/networks/snake/dla.py", line 432, in init self.base = globals()base_name File "lib/networks/snake/dla.py", line 315, in dla34 model.load_pretrained_model(data='imagenet', name='dla34', hash='ba72cf86') File "lib/networks/snake/dla.py", line 301, in load_pretrained_model model_weights = model_zoo.load_url(model_url) File "/data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/hub.py", line 506, in load_state_dict_from_url return torch.load(cached_file, map_location=map_location) File "/data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/serialization.py", line 529, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/serialization.py", line 709, in _legacy_load deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly) RuntimeError: unexpected EOF, expected 885842 more bytes. The file might be corrupted. terminate called after throwing an instance of 'c10::Error' what(): owning_ptr == NullType::singleton() || owningptr->refcount.load() > 0 INTERNAL ASSERT FAILED at /pytorch/c10/util/intrusive_ptr.h:348, please report a bug to PyTorch. intrusive_ptr: Can only intrusive_ptr::reclaim() owning pointers that were created using intrusive_ptr::release(). (reclaim at /pytorch/c10/util/intrusive_ptr.h:348) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fe0a0fcc193 in /data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: + 0x18cd59f (0x7fdfd591059f in /data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #2: THStorage_free + 0x17 (0x7fdfd60d8ba7 in /data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/lib/libtorch.so) frame #3: + 0x55d4dd (0x7fe0a82c44dd in /data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #26: __libc_start_main + 0xe7 (0x7fe0ac795b97 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) 我已经下载了dla34-ba72cf86.pth放在snake/lib/networks/snake/dla34-ba72cf86.pth下，请问这个模型是干什么的

pengsida commented 4 years ago

模型没下完整。 dla34-ba72cf86.pth是dla的预训练模型。

watertianyi commented 4 years ago

loading annotations into memory... Done (t=0.04s) creating index... index created! WARNING: NO MODEL LOADED !!! loading annotations into memory... Done (t=0.39s) creating index... index created! loading annotations into memory... Done (t=0.03s) creating index... index created! Traceback (most recent call last): File "train_net.py", line 54, in main() File "train_net.py", line 50, in main train(cfg, network) File "train_net.py", line 25, in train trainer.train(epoch, train_loader, optimizer, recorder) File "/data/hongjq/object_detect/snake/lib/train/trainers/trainer.py", line 32, in train for iteration, batch in enumerate(data_loader): File "/data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in next data = self._next_data() File "/data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data return self._process_data(data) File "/data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data data.reraise() File "/data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise raise self.exc_type(msg) IndexError: Caught IndexError in DataLoader worker process 0. Original Traceback (most recent call last): File "/data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "lib/datasets/kins/snake.py", line 183, in getitem decode_box = self.prepare_detection(bbox, poly, ct_hm, cls_id, wh, ct_cls, ct_ind) File "lib/datasets/kins/snake.py", line 73, in prepare_detection ct_hm = ct_hm[cls_id] IndexError: index 1 is out of bounds for axis 0 with size 1

我重新单独下载了模型，修改了代码如下，报错如上： def load_pretrained_model(self, data='imagenet', name='dla34', hash='ba72cf86'):

fc = self.fc

    # if name.endswith('.pth'):
    #     model_weights = torch.load(data + name)
    # else:
    #     model_url = get_model_url(data, name, hash)
    #     model_weights = model_zoo.load_url(model_url)
    model_weights = torch.load("/snake/lib/networks/snake/dla34-ba72cf86.pth")
    num_classes = len(model_weights[list(model_weights.keys())[-1]])
    self.fc = nn.Conv2d(
        self.channels[-1], num_classes,
        kernel_size=1, stride=1, padding=0, bias=True)
    self.load_state_dict(model_weights)
    # self.fc = fc

pengsida commented 4 years ago

不建议修改它dla里面的代码，把模型放到pytorch官方模型放的目录下就可以了。 IndexError: index 1 is out of bounds for axis 0 with size 1 这个应该是annotation的类别数量和ct_hm的channel数不一样的原因。

watertianyi commented 4 years ago

这个效果怎么样？里面的ct_loss,wh_loss,ex_loss,py_loss分别代表什么loss呢，每次只测试500张图像，我可以设置为任何数量的测试数据吗？

pengsida commented 4 years ago

可以可视化看效果。
ct_loss和wh_loss是检测的loss，ex_loss是extreme point的loss，py_loss是polygon的loss。
可以，改成test.dataset KinsVal。

watertianyi commented 4 years ago

@pengsida loading annotations into memory... Done (t=0.06s) creating index... index created! WARNING: NO MODEL LOADED !!! loading annotations into memory... Done (t=0.31s) creating index... index created! loading annotations into memory... Done (t=0.02s) creating index... index created! error in modulated_deformable_im2col_cuda: invalid device function Traceback (most recent call last): File "train_net.py", line 56, in main() File "train_net.py", line 52, in main train(cfg, network) File "train_net.py", line 27, in train trainer.train(epoch, train_loader, optimizer, recorder) File "/data/internet/image_segement/snake/lib/train/trainers/trainer.py", line 38, in train output, loss, loss_stats, image_stats = self.network(batch) File "/home/hjq/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/home/hjq/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/hjq/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "lib/train/trainers/snake.py", line 19, in forward output = self.net(batch['inp'], batch) File "/home/hjq/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "lib/networks/snake/ct_snake.py", line 54, in forward output, cnn_feature = self.dla(x) File "/home/hjq/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "lib/networks/snake/dla.py", line 470, in forward x = self.dla_up(x) File "/home/hjq/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "lib/networks/snake/dla.py", line 409, in forward ida(layers, len(layers) - i - 2, len(layers)) File "/home/hjq/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "lib/networks/snake/dla.py", line 385, in forward layers[i] = node(layers[i] + layers[i - 1]) File "/home/hjq/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "lib/networks/snake/dla.py", line 356, in forward x = self.conv(x) File "/home/hjq/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/data/internet/image_segement/snake/lib/networks/dcn_v2.py", line 128, in forward self.deformable_groups) File "/data/internet/image_segement/snake/lib/networks/dcn_v2.py", line 31, in forward ctx.deformable_groups) RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:343 段错误 (核心已转储)

pengsida commented 4 years ago

你代码原来可以跑的，后来有改过什么吗

watertianyi commented 4 years ago

原来在双卡1080gpu上跑通了，现在在单卡2080Ti上运行就报这个问题

------------------ 原始邮件 ------------------ 发件人: pengsida <notifications@github.com> 发送时间: 2020年8月31日 21:51 收件人: zju3dv/snake <snake@noreply.github.com> 抄送: jimi <247607771@qq.com>, Author <author@noreply.github.com> 主题: 回复：[zju3dv/snake] model.load_pretrained_model(data='imagenet', name='dla34', hash='ba72cf86')下载预训练模型报错？ (#87)

你代码原来可以跑的，后来有改过什么吗

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

pengsida commented 4 years ago

系统环境变了吗，看起来是系统cuda和pytorch的cuda的版本不一样的原因。

watertianyi commented 4 years ago

都是cuda10.1

------------------ 原始邮件 ------------------ 发件人: pengsida <notifications@github.com> 发送时间: 2020年8月31日 22:08 收件人: zju3dv/snake <snake@noreply.github.com> 抄送: jimi <247607771@qq.com>, Author <author@noreply.github.com> 主题: 回复：[zju3dv/snake] model.load_pretrained_model(data='imagenet', name='dla34', hash='ba72cf86')下载预训练模型报错？ (#87)

系统环境变了吗，看起来是系统cuda和pytorch的cuda的版本不一样的原因。

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

watertianyi commented 4 years ago

Ubuntu系统

------------------ 原始邮件 ------------------ 发件人: pengsida <notifications@github.com> 发送时间: 2020年8月31日 22:08 收件人: zju3dv/snake <snake@noreply.github.com> 抄送: jimi <247607771@qq.com>, Author <author@noreply.github.com> 主题: 回复：[zju3dv/snake] model.load_pretrained_model(data='imagenet', name='dla34', hash='ba72cf86')下载预训练模型报错？ (#87)

系统环境变了吗，看起来是系统cuda和pytorch的cuda的版本不一样的原因。

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

watertianyi commented 4 years ago

@pengsida

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1310 G /usr/lib/xorg/Xorg 26MiB | | 0 1468 G /usr/bin/gnome-shell 57MiB | | 0 2417 G /usr/lib/xorg/Xorg 226MiB | | 0 2531 G /usr/bin/gnome-shell 210MiB | | 0 2964 G ...pychram/pycharm-2018.3.2/jre64/bin/java 19MiB | | 0 5219 G /usr/lib/firefox/firefox 6MiB | | 0 18161 G ...equest-channel-token=537185062394684867 54MiB | +-----------------------------------------------------------------------------+

cudnn的版本是：

define CUDNN_MAJOR 7

define CUDNN_MINOR 6

define CUDNN_PATCHLEVEL 3

--

define CUDNN_VERSION (CUDNN_MAJOR 1000 + CUDNN_MINOR 100 + CUDNN_PATCHLEVEL)

include "driver_types.h"

conda环境： _libgcc_mutex 0.1 main defaults apex 0.1 blas 1.0 mkl https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free ca-certificates 2020.7.22 0 defaults certifi 2020.6.20 py37_0 defaults cudatoolkit 10.1.243 h6bb024c_0 defaults cycler 0.10.0 Cython 0.28.2 decorator 4.4.2 freetype 2.10.2 h5ab3b9f_0 defaults imageio 2.9.0 imgaug 0.2.9 intel-openmp 2020.2 254 defaults jpeg 9b 0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free kiwisolver 1.2.0 lcms2 2.11 h396b838_0 defaults ld_impl_linux-64 2.33.1 h53a641e_7 defaults libedit 3.1.20191231 h14c3975_1 defaults libffi 3.3 he6710b0_2 defaults libgcc-ng 9.1.0 hdf63c60_0 defaults libpng 1.6.37 hbc83047_0 defaults libstdcxx-ng 9.1.0 hdf63c60_0 defaults libtiff 4.1.0 h2733197_1 defaults lz4-c 1.9.2 he6710b0_1 defaults matplotlib 3.3.1 mkl 2020.2 256 defaults mkl-service 2.3.0 py37he904b0f_0 defaults mkl_fft 1.1.0 py37h23d657b_0 defaults mkl_random 1.1.1 py37h0573a6f_0 defaults ncurses 6.2 he6710b0_1 defaults networkx 2.5 ninja 1.7.2 0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free numpy 1.19.1 py37hbc911f0_0 defaults numpy 1.16.4 numpy-base 1.19.1 py37hfa32c7d_0 defaults olefile 0.46 py_0 defaults opencv-contrib-python 3.4.2.17 opencv-python 3.4.2.17 openssl 1.1.1g h7b6447c_0 defaults pillow 7.2.0 py37hb39fc2d_0 defaults pip 20.2.2 py37_0 defaults protobuf 3.13.0 pycocotools 2.0.0 pyparsing 2.4.7 python 3.7.7 hcff3b4d_5 defaults python-dateutil 2.8.1 pytorch 1.4.0 py3.7_cuda10.1.243_cudnn7.6.3_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch PyWavelets 1.1.1 PyYAML 5.3.1 readline 8.0 h7b6447c_0 defaults scikit-image 0.17.2 scipy 1.5.2 setuptools 49.6.0 py37_0 defaults Shapely 1.7.1 six 1.15.0 py_0 defaults sqlite 3.33.0 h62c20be_0 defaults tensorboardX 1.2 termcolor 1.1.0 tifffile 2020.8.25 tk 8.6.10 hbc83047_0 defaults torch 1.4.0 torchvision 0.5.0 py37_cu101 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch torchvision 0.5.0 tqdm 4.28.1 wheel 0.35.1 py_0 defaults xz 5.2.5 h7b6447c_0 defaults yacs 0.1.4 zlib 1.2.11 0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free zstd 1.4.5 h9ceee32_0 defaults

pengsida commented 4 years ago

我看不出什么原因，可能是dcn v2的代码不兼容cuda 10.1，也可能是你编译的时候哪里操作有问题。

watertianyi commented 4 years ago

@pengsida 已经解决了，主要问题不是环境问题，我重新git了代码，重新执行了如下操作，就可以运行了：

Compile cuda extensions under `lib/csrc`

ROOT=/path/to/snake
cd $ROOT/lib/csrc
export CUDA_HOME="/usr/local/cuda-9.0"
cd dcn_v2
python setup.py build_ext --inplace
cd ../extreme_utils
python setup.py build_ext --inplace
cd ../roi_align_layer
python setup.py build_ext --inplace

主要原因是之前装了其他版本的 pytorch编译导致错误，然后也没有清理

watertianyi commented 4 years ago

tensorboard：未找到命令出现这个问你是不是需要安装tensorflow，你安装了吗

pengsida commented 4 years ago

我装的是tensorboardX==1.2。

watertianyi commented 4 years ago

@pengsida
我的也是，就出现上述的：tensorboard：未找到命令训练了一阵子报错如下： eta: 0:00:12 epoch: 4 step: 3364 ct_loss: 0.8314 wh_loss: 3.3797 ex_loss: 2.3505 py_loss: 4.2017 loss: 7.7216 data: 0.0043 batch: 0.7364 lr: 0.000100 max_mem: 5861 eta: 0:00:00 epoch: 4 step: 3379 ct_loss: 0.7762 wh_loss: 3.7410 ex_loss: 2.1634 py_loss: 3.5473 loss: 6.8610 data: 0.0044 batch: 0.7224 lr: 0.000100 max_mem: 5861 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:25<00:00, 19.60it/s] ['ct_loss: 1.0798', 'wh_loss: 5.0305', 'ex_loss: 3.6084', 'py_loss: 5.5685', 'loss: 10.7597'] Loading and preparing results... DONE (t=0.15s) creating index... index created! Traceback (most recent call last): File "train_net.py", line 55, in main() File "train_net.py", line 51, in main train(cfg, network) File "train_net.py", line 33, in train trainer.val(epoch, val_loader, evaluator, recorder) File "/data/internet/image_segement/snake/lib/train/trainers/trainer.py", line 98, in val result = evaluator.summarize() File "/data/internet/image_segement/snake/lib/evaluators/coco/snake.py", line 72, in summarize coco_eval = COCOeval(self.coco, coco_dets, 'segm') File "/home/hjq/anaconda3/envs/snake1.4/lib/python3.7/site-packages/pycocotools/cocoeval.py", line 76, in init self.params = Params(iouType=iouType) # parameters File "/home/hjq/anaconda3/envs/snake1.4/lib/python3.7/site-packages/pycocotools/cocoeval.py", line 527, in init self.setDetParams() File "/home/hjq/anaconda3/envs/snake1.4/lib/python3.7/site-packages/pycocotools/cocoeval.py", line 507, in setDetParams self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True) File "<__array_function__ internals>", line 6, in linspace File "/home/hjq/anaconda3/envs/snake1.4/lib/python3.7/site-packages/numpy/core/function_base.py", line 113, in linspace num = operator.index(num) TypeError: 'numpy.float64' object cannot be interpreted as an integer

watertianyi commented 4 years ago

@pengsida 已经解决了TypeError: 'numpy.float64' object cannot be interpreted as an integer

解决方案：建议源码安装coco，

输入以下命令：

pip install git+https://github.com/philferriere/cocoapi.git#subdirectory=PythonAPI

如果上述命令不可以，输入以下命令：

git clone https://github.com/cocodataset/cocoapi.git cd cocoapi/PythonAPI/ python setup.py build_ext --inplace python setup.py build_ext install

验证cocoapi是否安装成功

进入python命令行就可以直接导入了，执行import pycocotools命令不报错说明已经安装成功了

zju3dv / snake

model.load_pretrained_model(data='imagenet', name='dla34', hash='ba72cf86')下载预训练模型报错？ #87

fc = self.fc

define CUDNN_MAJOR 7

define CUDNN_MINOR 6

define CUDNN_PATCHLEVEL 3

define CUDNN_VERSION (CUDNN_MAJOR 1000 + CUDNN_MINOR 100 + CUDNN_PATCHLEVEL)

include "driver_types.h"

Compile cuda extensions under `lib/csrc`

解决方案：建议源码安装coco，

输入以下命令：

如果上述命令不可以，输入以下命令：

验证cocoapi是否安装成功

进入python命令行就可以直接导入了，执行import pycocotools命令不报错说明已经安装成功了

zju3dv / snake

model.load_pretrained_model(data='imagenet', name='dla34', hash='ba72cf86')下载预训练模型报错？ #87

fc = self.fc

define CUDNN_MAJOR 7

define CUDNN_MINOR 6

define CUDNN_PATCHLEVEL 3

define CUDNN_VERSION (CUDNN_MAJOR 1000 + CUDNN_MINOR 100 + CUDNN_PATCHLEVEL)

include "driver_types.h"

Compile cuda extensions under lib/csrc

解决方案：建议源码安装coco，

输入以下命令：

如果上述命令不可以，输入以下命令：

验证cocoapi是否安装成功

进入python命令行就可以直接导入了，执行import pycocotools命令不报错说明已经安装成功了

Compile cuda extensions under `lib/csrc`