Closed watertianyi closed 4 years ago
模型没下完整。
dla34-ba72cf86.pth
是dla的预训练模型。
loading annotations into memory...
Done (t=0.04s)
creating index...
index created!
WARNING: NO MODEL LOADED !!!
loading annotations into memory...
Done (t=0.39s)
creating index...
index created!
loading annotations into memory...
Done (t=0.03s)
creating index...
index created!
Traceback (most recent call last):
File "train_net.py", line 54, in
我重新单独下载了模型,修改了代码如下,报错如上: def load_pretrained_model(self, data='imagenet', name='dla34', hash='ba72cf86'):
# if name.endswith('.pth'):
# model_weights = torch.load(data + name)
# else:
# model_url = get_model_url(data, name, hash)
# model_weights = model_zoo.load_url(model_url)
model_weights = torch.load("/snake/lib/networks/snake/dla34-ba72cf86.pth")
num_classes = len(model_weights[list(model_weights.keys())[-1]])
self.fc = nn.Conv2d(
self.channels[-1], num_classes,
kernel_size=1, stride=1, padding=0, bias=True)
self.load_state_dict(model_weights)
# self.fc = fc
不建议修改它dla里面的代码,把模型放到pytorch官方模型放的目录下就可以了。
IndexError: index 1 is out of bounds for axis 0 with size 1
这个应该是annotation的类别数量和ct_hm的channel数不一样的原因。
这个效果怎么样?里面的ct_loss,wh_loss,ex_loss,py_loss分别代表什么loss呢,每次只测试500张图像,我可以设置为任何数量的测试数据吗?
test.dataset KinsVal
。@pengsida
loading annotations into memory...
Done (t=0.06s)
creating index...
index created!
WARNING: NO MODEL LOADED !!!
loading annotations into memory...
Done (t=0.31s)
creating index...
index created!
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
error in modulated_deformable_im2col_cuda: invalid device function
Traceback (most recent call last):
File "train_net.py", line 56, in
你代码原来可以跑的,后来有改过什么吗
原来在双卡1080gpu上跑通了,现在在单卡2080Ti上运行就报这个问题
------------------ 原始邮件 ------------------ 发件人: pengsida <notifications@github.com> 发送时间: 2020年8月31日 21:51 收件人: zju3dv/snake <snake@noreply.github.com> 抄送: jimi <247607771@qq.com>, Author <author@noreply.github.com> 主题: 回复:[zju3dv/snake] model.load_pretrained_model(data='imagenet', name='dla34', hash='ba72cf86')下载预训练模型报错? (#87)
你代码原来可以跑的,后来有改过什么吗
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
系统环境变了吗,看起来是系统cuda和pytorch的cuda的版本不一样的原因。
都是cuda10.1
------------------ 原始邮件 ------------------ 发件人: pengsida <notifications@github.com> 发送时间: 2020年8月31日 22:08 收件人: zju3dv/snake <snake@noreply.github.com> 抄送: jimi <247607771@qq.com>, Author <author@noreply.github.com> 主题: 回复:[zju3dv/snake] model.load_pretrained_model(data='imagenet', name='dla34', hash='ba72cf86')下载预训练模型报错? (#87)
系统环境变了吗,看起来是系统cuda和pytorch的cuda的版本不一样的原因。
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Ubuntu系统
------------------ 原始邮件 ------------------ 发件人: pengsida <notifications@github.com> 发送时间: 2020年8月31日 22:08 收件人: zju3dv/snake <snake@noreply.github.com> 抄送: jimi <247607771@qq.com>, Author <author@noreply.github.com> 主题: 回复:[zju3dv/snake] model.load_pretrained_model(data='imagenet', name='dla34', hash='ba72cf86')下载预训练模型报错? (#87)
系统环境变了吗,看起来是系统cuda和pytorch的cuda的版本不一样的原因。
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
@pengsida
显卡:
Tue Sep 1 09:03:35 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21 Driver Version: 435.21 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 On | N/A |
| 35% 44C P8 30W / 250W | 603MiB / 11016MiB | 8% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1310 G /usr/lib/xorg/Xorg 26MiB | | 0 1468 G /usr/bin/gnome-shell 57MiB | | 0 2417 G /usr/lib/xorg/Xorg 226MiB | | 0 2531 G /usr/bin/gnome-shell 210MiB | | 0 2964 G ...pychram/pycharm-2018.3.2/jre64/bin/java 19MiB | | 0 5219 G /usr/lib/firefox/firefox 6MiB | | 0 18161 G ...equest-channel-token=537185062394684867 54MiB | +-----------------------------------------------------------------------------+
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Fri_Feb__8_19:08:17_PST_2019 Cuda compilation tools, release 10.1, V10.1.105
cudnn的版本是:
--
conda环境:
_libgcc_mutex 0.1 main defaults
apex 0.1
我看不出什么原因,可能是dcn v2的代码不兼容cuda 10.1,也可能是你编译的时候哪里操作有问题。
@pengsida 已经解决了,主要问题不是环境问题,我重新git了代码,重新执行了如下操作,就可以运行了:
lib/csrc
ROOT=/path/to/snake
cd $ROOT/lib/csrc
export CUDA_HOME="/usr/local/cuda-9.0"
cd dcn_v2
python setup.py build_ext --inplace
cd ../extreme_utils
python setup.py build_ext --inplace
cd ../roi_align_layer
python setup.py build_ext --inplace
主要原因是之前装了其他版本的 pytorch编译导致错误,然后也没有清理
tensorboard:未找到命令 出现这个问你是不是需要安装tensorflow,你安装了吗
我装的是tensorboardX==1.2
。
@pengsida
我的也是,就出现上述的:tensorboard:未找到命令
训练了一阵子报错如下:
eta: 0:00:12 epoch: 4 step: 3364 ct_loss: 0.8314 wh_loss: 3.3797 ex_loss: 2.3505 py_loss: 4.2017 loss: 7.7216 data: 0.0043 batch: 0.7364 lr: 0.000100 max_mem: 5861
eta: 0:00:00 epoch: 4 step: 3379 ct_loss: 0.7762 wh_loss: 3.7410 ex_loss: 2.1634 py_loss: 3.5473 loss: 6.8610 data: 0.0044 batch: 0.7224 lr: 0.000100 max_mem: 5861
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:25<00:00, 19.60it/s]
['ct_loss: 1.0798', 'wh_loss: 5.0305', 'ex_loss: 3.6084', 'py_loss: 5.5685', 'loss: 10.7597']
Loading and preparing results...
DONE (t=0.15s)
creating index...
index created!
Traceback (most recent call last):
File "train_net.py", line 55, in
@pengsida 已经解决了TypeError: 'numpy.float64' object cannot be interpreted as an integer
pip install git+https://github.com/philferriere/cocoapi.git#subdirectory=PythonAPI
git clone https://github.com/cocodataset/cocoapi.git cd cocoapi/PythonAPI/ python setup.py build_ext --inplace python setup.py build_ext install
kins_snake Traceback (most recent call last): File "train_net.py", line 54, in
main()
File "train_net.py", line 46, in main
network = make_network(cfg)
File "/data/hongjq/object_detect/snake/lib/networks/make_network.py", line 23, in make_network
return imp.load_source(module, path).get_network(cfg)
File "lib/networks/snake/init.py", line 17, in get_network
network = get_model(num_layers, heads, head_conv, snake_config.down_ratio, cfg.det_dir)
File "lib/networks/snake/ct_snake.py", line 64, in get_network
network = Network(num_layers, heads, head_conv, down_ratio, det_dir)
File "lib/networks/snake/ct_snake.py", line 19, in init
head_conv=head_conv)
File "lib/networks/snake/dla.py", line 432, in init
self.base = globals()base_name
File "lib/networks/snake/dla.py", line 315, in dla34
model.load_pretrained_model(data='imagenet', name='dla34', hash='ba72cf86')
File "lib/networks/snake/dla.py", line 301, in load_pretrained_model
model_weights = model_zoo.load_url(model_url)
File "/data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/hub.py", line 506, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location)
File "/data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/serialization.py", line 529, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/serialization.py", line 709, in _legacy_load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: unexpected EOF, expected 885842 more bytes. The file might be corrupted.
terminate called after throwing an instance of 'c10::Error'
what(): owning_ptr == NullType::singleton() || owningptr->refcount.load() > 0 INTERNAL ASSERT FAILED at /pytorch/c10/util/intrusive_ptr.h:348, please report a bug to PyTorch. intrusive_ptr: Can only intrusive_ptr::reclaim() owning pointers that were created using intrusive_ptr::release(). (reclaim at /pytorch/c10/util/intrusive_ptr.h:348)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fe0a0fcc193 in /data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x18cd59f (0x7fdfd591059f in /data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #2: THStorage_free + 0x17 (0x7fdfd60d8ba7 in /data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: + 0x55d4dd (0x7fe0a82c44dd in /data/tools/anaconda3/envs/snake/lib/python3.7/site-packages/torch/lib/libtorch_python.so)