zhengye1995 / Tianchi-2019-Guangdong-Intelligent-identification-of-cloth-defects-rank5

天池2019广东工业智造创新大赛 布匹疵点检测 天池水也太深了 季军解决方案
399 stars 142 forks source link

training error,Help! #27

Closed zhxngjxnhxx closed 2 years ago

zhxngjxnhxx commented 3 years ago

When I run train.sh the following error occurs image

Especially this sentence:RuntimeError: While copying the parameter named bbox_head.0.fc_cls.weight, whose dimensions in the model are torch.Size([16, 1024]) and whose dimensions in the checkpoint are torch.Size([81, 1024]). Point out that the error is that the dimensions do not match each other, but the weights are downloaded according to the URL in train.sh ,How can I solve this error?

zhengye1995 commented 3 years ago

In the fabric dataset, there are 15+1 categories (15 fabric defects and 1 background), so the last FC layer of each cascade head has a 16 dimension weight.

In the MS-COCO dataset, there are 80+1 categories, so the dimension of weight is 81.

In training, my code will automatically ignore the parameters that have the mismatch in the dimensions when it loads the concatenated pre-trained weight in the model. So, If the version of your python packages (like pytorch, mmcv etc ) is the same as mine, this problem will not occur. The checkpoint '.pth' file is a standard pytorch weight file, if you want to keep your existing packages, you can change the dimensions of the parameters (bbox_head.0.fc_cls.weight, bbox_head.1.fc_cls.weight, bbox_head.2.fc_cls.weight or others be mentioned in the error log) to the target dimension and save a new weight file to load in.

zhxngjxnhxx commented 3 years ago

@zhengye1995 thx first But here is my situation: In fact, at first I did follow the steps in your readme file and did not modify any files,

But sometimes even the setup.py cannot run normally I noticed that in your command: conda install pytorch=1.1.0 torchvision=0.3.0 cudatoolkit=10.0 -c pytorch conflicts with this sentence pip install cython && pip --no-cache-dir install -r requirements.txt image

In the requirements file Those version numbers are preceded by >= Sometimes will install the latest version of the package and sometimes the latest pytorch or mmcv will be installed and the version will not match, which will eventually lead to running errors. Have you ever conducted a version compatibility test? In my previous attempts, I only limited the version number of mmcv in the requirements, and in order to avoid repeated installation, I deleted torch>=1.1 and torchvision. Like this: image and this is my dist_train.sh settings image This is the setting of my dist_train.sh, where the commented out part is done locally, to reduce repeated operations when uploading to the server. so if i dont limit the version number of the packages,match error,if i limit the version number of the packages,the error above.

zhxngjxnhxx commented 3 years ago

I only have 1 GPU, is this setting correct? image

zhengye1995 commented 3 years ago

I only have 1 GPU, is this setting correct? image

Yes, this setting is correct.

zhengye1995 commented 3 years ago

@zhengye1995 thx first But here is my situation: In fact, at first I did follow the steps in your readme file and did not modify any files,

But sometimes even the setup.py cannot run normally I noticed that in your command: conda install pytorch=1.1.0 torchvision=0.3.0 cudatoolkit=10.0 -c pytorch conflicts with this sentence pip install cython && pip --no-cache-dir install -r requirements.txt image

In the requirements file Those version numbers are preceded by >= Sometimes will install the latest version of the package and sometimes the latest pytorch or mmcv will be installed and the version will not match, which will eventually lead to running errors. Have you ever conducted a version compatibility test? In my previous attempts, I only limited the version number of mmcv in the requirements, and in order to avoid repeated installation, I deleted torch>=1.1 and torchvision. Like this: image and this is my dist_train.sh settings image This is the setting of my dist_train.sh, where the commented out part is done locally, to reduce repeated operations when uploading to the server. so if i dont limit the version number of the packages,match error,if i limit the version number of the packages,the error above.

This is the version of my python packages: pytorch==1.1.0 mmcv==0.2.14 mmdet==1.0rc0 (after build) image

ldm0 commented 2 years ago

I solved this problem by specifying mmcv==0.2.14 in requirements.txt.

zhengye1995 commented 2 years ago

I solved this problem by specifying mmcv==0.2.14 in requirements.txt.

Congratulations!