yanxp / MetaR-CNN

Meta R-CNN : Towards General Solver for Instance-level Low-shot Learning
https://yanxp.github.io/metarcnn.html
177 stars 23 forks source link

Some problem about the phase2 training #5

Closed lzhnb closed 4 years ago

lzhnb commented 4 years ago

The training Phase1 is finished, although it met some error about dataset load and i run in the screen but forgor to record, fortunately i got the weight of 20 epoch. First is about your example command for the phase2 training, i put my modification into pull requests but i still have some problem

$>CUDA_VISIBLE_DEVICES=0 python test_metarcnn.py --dataset pascal_voc_0712 --net metarcnn --load_dir models/meta/fisrt  --checksession 10 --checkepoch 30 --checkpoint 111 --shots 10  --meta_type 1 --meta_test True --meta_loss True

I see your code

        if args.meta_train:
            save_name = os.path.join(output_dir,
                                     '{}_{}_{}_{}_{}.pth'.format(str(args.dataset), str(args.net), shots, epoch,
                                                                 step))
        else:
            save_name = os.path.join(output_dir, '{}_{}_{}_{}.pth'.format(str(args.dataset), str(args.net),
                                                                          epoch, step))

My weight's name is pascal_voc_0712_metarcnn_200_20_1540.pth which depends on dataset, network, shots, epoch and step

But in your loading code

    if args.resume:
        load_name = os.path.join(output_dir,
                                 '{}_metarcnn_{}_{}_{}.pth'.format(args.dataset, args.checksession,
                                                                   args.checkepoch, args.checkpoint))
        print("loading checkpoint %s" % (load_name))
        checkpoint = torch.load(load_name)
        args.session = checkpoint['session']
        args.start_epoch = checkpoint['epoch']
        # the number of classes in second phase is different from first phase
        if args.phase == 2:
            new_state_dict = OrderedDict()
            # initilize params of RCNN_cls_score and RCNN_bbox_pred for second phase
            RCNN_cls_score = nn.Linear(2048, imdb.num_classes)
            RCNN_bbox_pred = nn.Linear(2048, 4 * imdb.num_classes)
            for k, v in checkpoint['model'].items():
                name = k
                new_state_dict[name] = v
                if 'RCNN_cls_score.weight' in k:
                    new_state_dict[name] = RCNN_cls_score.weight
                if 'RCNN_cls_score.bias' in k:
                    new_state_dict[name] = RCNN_cls_score.bias
                if 'RCNN_bbox_pred.weight' in k:
                    new_state_dict[name] = RCNN_bbox_pred.weight
                if 'RCNN_bbox_pred.bias' in k:
                    new_state_dict[name] = RCNN_bbox_pred.bias
            fasterRCNN.load_state_dict(new_state_dict)

It depends on dataset, network(constant), checksession, checkepoch, checkpoint The checksession and checkpoint is different from the training parameter which gives to the saving wight'name shot and step, isn't it right?

And also i change it myself, but i fail in load weight

Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
loading checkpoint models/meta/first/pascal_voc_0712_metarcnn_200_20_1540.pth
Traceback (most recent call last):
  File "train_metarcnn.py", line 300, in <module>
    fasterRCNN.load_state_dict(new_state_dict)
  File "/home/lzhnb/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/nn/modules/module.py", line 526, in load_state_dict
    raise KeyError('missing keys in state_dict: "{}"'.format(missing))
KeyError: 'missing keys in state_dict: "{\'Meta_cls_score.weight\', \'Meta_cls_score.bias\'}"'

Thanks

lzhnb commented 4 years ago

Oh i find me copied part of the training command so that misiing meta_loss=True in phase1 so i miss the meta loss branch.

But i think the parameter should modify.

yanxp commented 4 years ago

Yes, the saving weight'name is different depending on the training parameters. Thanks for reminding, the parameter has been modified.

lzhnb commented 4 years ago

I‘ve finished the first phase training, and i also read your code about testing, i have few question about the hold processing:

  1. the first training will choose 15 voc_classes and get 200 instances for each class as the support_set or meta_dataset, the shots=200 means the number of each class_set, right?. And during training, the metadataset will be shuffled each item and pick up 1 shot for each classes(15 classes) to generate attentions and select batch_size images to be query, right?
  2. I'll begin the second phase training, first, it will select 10 shots for each classes, then, building up roidb for query, it will contain all base instances and 10 instances for each novel classes, right?
  3. According to your testing code, the attentions is generated outside and this function run after training:
    • I got the mean_attentions after the first phase traing, it names 1_shots_1_mean_class_attentions.pkl, but it should mean by 200(real shots) and i think it's useless, right? The valid attentions should be generated after the second phase training, which should name 2_shots_10_mean_class_attentions.pkl for all classes(20 classes) in your training example, right?
    • I think the mean_class_attentions can generate independently, so can i use the 10-shots trained weight to generate the 30-mean attentions? Doesn't it work? even i use the first phase trained weight to generate the mean_class_attentions without the 'finetuning'. I want to analyze its generalization, and for the few-shot learning the novel classes should not affect the model. Are the baselines that your paper lists train independently?
XiongweiWu commented 4 years ago

@lzhnb What's ur facility to run the first stage, or in other words, how much memory required? It seems 10G GPU cards cannot obtain the baseline due to memory issue.

lzhnb commented 4 years ago

@lzhnb What's ur facility to run the first stage, or in other words, how much memory required? It seems 10G GPU cards cannot obtain the baseline due to memory issue.

My GPU is Titan XP whose memory is 12G.

XiongweiWu commented 4 years ago

@yanxp hi, can u provide the pre-trained models of ResNet-50 and ResNet-34 converted from caffe? I found the pytorch pretrain models perform inferior to the caffe version.

yanxp commented 4 years ago

@lzhnb

  1. the shots=200 means the number of base classes in the first phase. Yes, during training, it will pick up 1 shot of 15 bbase classes as the prn-network input.
  2. Yes, it will contain all(3*shots) base classes and shots novel classes.
  3. (1). After first phase, it will save 1_shots_200_mean_class_attentions.pkl which is useless just used for evaluating the base class for first phase. You are right for that. (2). Yeah, the mean_class_attentions can be generated indepentently. In our paper, we use the k-shot weights to generate k-shot mean-class-attentions. But you can try it, I think it will not affect too much.
yanxp commented 4 years ago

@XiongweiWu we just used the pytorch pretrained model of ResNet-50 and ResNet-34.