yxgeee / FD-GAN

[NeurIPS-2018] FD-GAN: Pose-guided Feature Distilling GAN for Robust Person Re-identification.
https://yxgeee.github.io/projects/fdgan.html
281 stars 80 forks source link

Results mismatch #6

Closed FightForCS closed 5 years ago

FightForCS commented 5 years ago

Hi, I have used your code and start Stage I training. All the settings are same except I use batch size of 160 (due to GPU memory limit). The test result is very bad:

First stage evaluation:
Mean AP: 23.9%
CMC Scores  market1501
  top-1          47.7%
  top-5          67.9%
  top-10         76.5%

Second stage evaluation:
Mean AP: 32.0%
CMC Scores  market1501
  top-1          62.7%
  top-5          81.7%
  top-10         86.9%

And when I use your released baseline model, I got quite good results

First stage evaluation:
Mean AP: 69.7%
CMC Scores  market1501
  top-1          86.8%
  top-5          95.4%
  top-10         97.4%

Second stage evaluation:
Mean AP: 72.7%
CMC Scores  market1501
  top-1          88.2%
  top-5          95.9%
  top-10         97.6%

Using smaller batch size will lead to worse result, but it should not be so much worse. I have also notice that during training, the validation results are also quite reasonable, for examples, the validation result after 80 epochs is:

First stage evaluation:
Mean AP: 62.0%
Extract Embedding: [500/1427]   Time 0.000 (0.000)  Data 0.000 (0.000)  
Extract Embedding: [1000/1427]  Time 0.000 (0.000)  Data 0.000 (0.000)  
Second stage evaluation:
Mean AP: 82.4%

Is there something that I was missing? thanks. I was using Anaconda Python3.6, pytorch 0.3.1, torchvision 0.2.0

yxgeee commented 5 years ago

How many GPUs do you use during training?

FightForCS commented 5 years ago

How many GPUs do you use during training?

@yxgeee 2 GPU 1080Ti

yxgeee commented 5 years ago

I think the validation results during training are also not so reasonable, since I got nearly 99% mAP on validation set. Maybe you can offer me more details about your settings. You can slightly modify the learning rates or the batch size for each GPU (sensitive for BN). I will try to run the baseline code with a smaller mini-batch some days later, if you still cannot solve this problem.

FightForCS commented 5 years ago

I think the validation results during training are also not so reasonable, since I got nearly 99% mAP on validation set. Maybe you can offer me more details about your settings. You can slightly modify the learning rates or the batch size for each GPU (sensitive for BN). I will try to run the baseline code with a smaller mini-batch some days later, if you still cannot solve this problem.

@yxgeee actually,I only modify -b 256 to -b 160 in the following command.

python baseline.py -b 256 -j 4 -d market1501 -a resnet50 --combine-trainval \
                    --lr 0.01 --epochs 100 --step-size 40 --eval-step 5 \
                    --logs-dir /path/to/save/checkpoints/
yuedong0607 commented 5 years ago

Hi, I am having exactly the same issue. The only modification I made is the -b 256 to -b 64.

First stage evaluation:
Mean AP: 27.2%
CMC Scores  market1501
  top-1          55.5%
  top-5          76.0%
  top-10         82.7%

Second stage evaluation:
Mean AP: 28.5%
CMC Scores  market1501
  top-1          66.4%
  top-5          78.1%
  top-10         82.7%

Any suggestion? Thanks in advance!

yxgeee commented 5 years ago

Thank you for your issue!

I found that it may be caused by the setting of optimizer (mismatch with my original code when I reconstructed it), and I made a new commit to fix this problem as https://github.com/yxgeee/FD-GAN/commit/4133f0ea9f4e4eb1d914689f43c5d1169c216481

yxgeee commented 5 years ago

By experiments, we found that the performance maybe slightly higher when keeping 10 times of learning rates between embed model and base model in the first 40 epochs, and then keep the same learning rates in the following steps (41-100 epoch). (100 epochs for training, step size 40)

Easy way to achieve the training scheme as above, (args.lr=0.01, args.step_size=40, args.epochs=100)

optimizer = torch.optim.SGD([
                               {'params': model.module.base_model.parameters()},
                               {'params': model.module.embed_model.parameters(), 'lr': args.lr*10}
                               ], args.lr,
                               momentum=args.momentum,
                               weight_decay=args.weight_decay)

def adjust_lr(epoch):
        lr = args.lr * (0.1 ** (epoch // args.step_size))
        for g in optimizer.param_groups:
            g['lr'] = lr
yxgeee commented 5 years ago

Hi, I am having exactly the same issue. The only modification I made is the -b 256 to -b 64.

First stage evaluation:
Mean AP: 27.2%
CMC Scores  market1501
  top-1          55.5%
  top-5          76.0%
  top-10         82.7%

Second stage evaluation:
Mean AP: 28.5%
CMC Scores  market1501
  top-1          66.4%
  top-5          78.1%
  top-10         82.7%

Any suggestion? Thanks in advance!

I have tried to train with the setting of batch size 64. (on the latest code)

Second stage evaluation:
Mean AP: 61.6%
CMC Scores  market1501
  top-1          81.0%
  top-5          92.3%
  top-10         95.4%
yuedong0607 commented 5 years ago

By experiments, we found that the performance maybe slightly higher when keeping 10 times of learning rates between embed model and base model in the first 40 epochs, and then keep the same learning rates in the following steps (41-100 epoch). (100 epochs for training, step size 40)

Easy way to achieve the training scheme as above, (args.lr=0.01, args.step_size=40, args.epochs=100)

optimizer = torch.optim.SGD([
                               {'params': model.module.base_model.parameters()},
                               {'params': model.module.embed_model.parameters(), 'lr': args.lr*10}
                               ], args.lr,
                               momentum=args.momentum,
                               weight_decay=args.weight_decay)

def adjust_lr(epoch):
        lr = args.lr * (0.1 ** (epoch // args.step_size))
        for g in optimizer.param_groups:
            g['lr'] = lr

Thanks a lot for your quick reply!

I have run your updated code and the previous issue has gone!

But I'm a little bit confused about your above code. Doesn't the adjust_lr function makes the learning rate to be the same for both base and embedding models at the first iteration?

yxgeee commented 5 years ago

By experiments, we found that the performance maybe slightly higher when keeping 10 times of learning rates between embed model and base model in the first 40 epochs, and then keep the same learning rates in the following steps (41-100 epoch). (100 epochs for training, step size 40) Easy way to achieve the training scheme as above, (args.lr=0.01, args.step_size=40, args.epochs=100)

optimizer = torch.optim.SGD([
                               {'params': model.module.base_model.parameters()},
                               {'params': model.module.embed_model.parameters(), 'lr': args.lr*10}
                               ], args.lr,
                               momentum=args.momentum,
                               weight_decay=args.weight_decay)

def adjust_lr(epoch):
        lr = args.lr * (0.1 ** (epoch // args.step_size))
        for g in optimizer.param_groups:
            g['lr'] = lr

Thanks a lot for your quick reply!

I have run your updated code and the previous issue has gone!

But I'm a little bit confused about your above code. Doesn't the adjust_lr function makes the learning rate to be the same for both base and embedding models at the first iteration?

You are right, I just made a toy example. Maybe you need to modify other code to achieve the mentioned scheme. It is only a small trick (from experiments), not so important. You can just follow the released code.

IvyYZ commented 5 years ago

How many GPUs do you use during training?

@yxgeee 2 GPU 1080Ti

I also use 2GPU 1080Ti, but I have some problems. How to solve them? I set batchsize 32. And dataset is from Market1501: [Baidu Pan]

First stage evaluation: Mean AP: 69.7% CMC Scores market1501 top-1 86.8% top-5 95.4% top-10 97.4% /home/ubuntu/zy/FD-GAN/reid/evaluators.py:33: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. pairwise_score[i, :, :] = model(Variable(probe_feature[i].view(1, -1).cuda(), volatile=True), /home/ubuntu/zy/FD-GAN/reid/evaluators.py:34: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. Variable(gallery_feature.cuda(), volatile=True)) Extract Embedding: [500/16483] Time 0.002 (0.001) Data 0.000 (0.000)
Extract Embedding: [1000/16483] Time 0.001 (0.001) Data 0.000 (0.000)
Extract Embedding: [1500/16483] Time 0.001 (0.001) Data 0.000 (0.000)
Extract Embedding: [2000/16483] Time 0.001 (0.001) Data 0.000 (0.000)
Extract Embedding: [2500/16483] Time 0.002 (0.001) Data 0.000 (0.000)
Extract Embedding: [3000/16483] Time 0.001 (0.001) Data 0.000 (0.000)
Extract Embedding: [3500/16483] Time 0.001 (0.001) Data 0.000 (0.000)
Extract Embedding: [4000/16483] Time 0.001 (0.001) Data 0.000 (0.000)
Extract Embedding: [4500/16483] Time 0.006 (0.001) Data 0.000 (0.000)
Extract Embedding: [5000/16483] Time 0.001 (0.001) Data 0.000 (0.000) Traceback (most recent call last): File "baseline.py", line 204, in main(parser.parse_args()) File "baseline.py", line 121, in main top1, mAP = evaluator.evaluate(test_loader, dataset.query, dataset.gallery, rerank_topk=100, dataset=args.dataset) File "/home/ubuntu/zy/FD-GAN/reid/evaluators.py", line 217, in evaluate query=query, topk_gallery=topk_gallery, rerank_topk=rerank_topk) File "/home/ubuntu/zy/FD-GAN/reid/evaluators.py", line 34, in extract_embeddings Variable(gallery_feature.cuda(), volatile=True)) File "/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, **kwargs) File "/home/ubuntu/zy/FD-GAN/reid/models/embedding.py", line 27, in forward x = x1 - x2 RuntimeError: CUDA error: out of memory

yxgeee commented 5 years ago

How many GPUs do you use during training?

@yxgeee 2 GPU 1080Ti

I also use 2GPU 1080Ti, but I have some problems. How to solve them? I set batchsize 32. And dataset is from Market1501: [Baidu Pan]

First stage evaluation:

Mean AP: 69.7% CMC Scores market1501 top-1 86.8% top-5 95.4% top-10 97.4% /home/ubuntu/zy/FD-GAN/reid/evaluators.py:33: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. pairwise_score[i, :, :] = model(Variable(probe_feature[i].view(1, -1).cuda(), volatile=True), /home/ubuntu/zy/FD-GAN/reid/evaluators.py:34: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. Variable(gallery_feature.cuda(), volatile=True)) Extract Embedding: [500/16483] Time 0.002 (0.001) Data 0.000 (0.000) Extract Embedding: [1000/16483] Time 0.001 (0.001) Data 0.000 (0.000) Extract Embedding: [1500/16483] Time 0.001 (0.001) Data 0.000 (0.000) Extract Embedding: [2000/16483] Time 0.001 (0.001) Data 0.000 (0.000) Extract Embedding: [2500/16483] Time 0.002 (0.001) Data 0.000 (0.000) Extract Embedding: [3000/16483] Time 0.001 (0.001) Data 0.000 (0.000) Extract Embedding: [3500/16483] Time 0.001 (0.001) Data 0.000 (0.000) Extract Embedding: [4000/16483] Time 0.001 (0.001) Data 0.000 (0.000) Extract Embedding: [4500/16483] Time 0.006 (0.001) Data 0.000 (0.000) Extract Embedding: [5000/16483] Time 0.001 (0.001) Data 0.000 (0.000) Traceback (most recent call last): File "baseline.py", line 204, in main(parser.parse_args()) File "baseline.py", line 121, in main top1, mAP = evaluator.evaluate(test_loader, dataset.query, dataset.gallery, rerank_topk=100, dataset=args.dataset) File "/home/ubuntu/zy/FD-GAN/reid/evaluators.py", line 217, in evaluate query=query, topk_gallery=topk_gallery, rerank_topk=rerank_topk) File "/home/ubuntu/zy/FD-GAN/reid/evaluators.py", line 34, in extract_embeddings Variable(gallery_feature.cuda(), volatile=True)) File "/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, **kwargs) File "/home/ubuntu/zy/FD-GAN/reid/models/embedding.py", line 27, in forward x = x1 - x2 RuntimeError: CUDA error: out of memory

try smaller batch size, e.g. 16

IvyYZ commented 5 years ago

try smaller batch size, e.g. 16

I have some run errors, when I use smaller bathsize than 32.

IvyYZ commented 5 years ago
    How many GPUs do you use during training?

@yxgeee 2 GPU 1080Ti

I also use 2GPU 1080Ti, but I have some problems. How to solve them? I set batchsize 32. And dataset is from Market1501: [Baidu Pan]

I have some problems. I want to know if the gallery set contains a query set? I get the list like this: Market1501 dataset loaded subset | # ids | # images

train | 651 | 11509 val | 100 | 1427 trainval | 751 | 12936 query | 750 | 16483 gallery | 751 | 19281

I modified the query set data because of "out of memory". I get the bad result, When I change the query set to a quarter of the gallery set. And the gallery set does not include the query set. Second stage evaluation: Mean AP: 0.3% CMC Scores market1501 top-1 0.1% top-5 0.2% top-10 0.2%

yxgeee commented 5 years ago
    How many GPUs do you use during training?

@yxgeee 2 GPU 1080Ti

I also use 2GPU 1080Ti, but I have some problems. How to solve them? I set batchsize 32. And dataset is from Market1501: [Baidu Pan]

I have some problems. I want to know if the gallery set contains a query set?

I get the list like this: Market1501 dataset loaded subset | # ids | # images train | 651 | 11509 val | 100 | 1427 trainval | 751 | 12936 query | 750 | 16483 gallery | 751 | 19281

I modified the query set data because of "out of memory". I get the bad result, When I change the query set to a quarter of the gallery set. And the gallery set does not include the query set. Second stage evaluation: Mean AP: 0.3% CMC Scores market1501 top-1 0.1% top-5 0.2% top-10 0.2%

Please do not modify the gallery and query in evaluation code. If you meet the error of out of memory, the solution is to try smaller batch size.