zhen-he / tracking-by-animation

A PyTorch implementation of the "Tracking-by-Animation" algorithm published at CVPR 2019.
123 stars 24 forks source link

the project can just run camera1 of duke_MTMC ? #7

Closed 756537479 closed 5 years ago

756537479 commented 5 years ago

I use camera1 data of Duke_MTMCT to train and test, but result train loss is 105, test loss is 4500. Is there any error with my operation? or, i need run all cameras to get true result.

zhen-he commented 5 years ago

Hi,

The default setting uses data from all cameras to train (--subtask ''), so you don't need to specify the subtask option. You only need to specify it to each camera during test.

You can also only use the camera1 data for train/test. But please firstly check the validation loss (it should be consistent with the training loss). The images for training and validation are masked by the ROI region. When testing, it uses the original images (without ROI masking), thus the test loss should be high. The reason of not using ROI masked images for test is that we do not care about the reconstruction loss for test, but only care about the bounding boxes accuracy (and for better visualization, as shown in the paper). However, you can also create an ROI masked test set, in which case the loss will be consistent with the trainng loss.

756537479 commented 5 years ago

thanks

zhen-he commented 5 years ago

For camera1, the ROI masked test set is under data/duke/pt/camera1/metric/input, and the test set with original images is under data/duke/pt/camera1/metric/org. You can do it by changing line 159 of run.py to X_org_seq = torch.load(path.join(data_dir, 'input', filename)).

756537479 commented 5 years ago

I run the camera1 test and get result is : recon: 1.000, tight: 0.000, entr: 0.000 Validation 776 / 776, loss = 4607.255 Final validation loss: 4558.872

camera1 trained result is about 100,and nobody is recognized

zhen-he commented 5 years ago

If you only train on camera1, before testing, I suggest you to check the trainng/validation curve by using scripts/show_curve.

To see the validation loss, you also need to change the training ratio in line 33 of scripts/gen_duke.py, i.e., by setting train_ratio = 0 if arg.metric == 1 else 0.96---the original training ratio is 1, which means no validation set is created.

If you find both the training/validation curve is ok (e.g., under 60), then you could try testing.

In my experiments, I have used all data, and the final training loss is about 40.

loss

Moreover, I'm not sure what the result would be if you only use camera1 data, since it might overfit the model with less data.

756537479 commented 5 years ago

ok, thanks, I got it. my train loss is floating around 100.

756537479 commented 5 years ago

I use all duke camera data to train, and use default config. Loss is strat 40-60 to change. Is this nomal? I drow the Validation Loss/Epoch picture find Loss shocked in 40-70 , and every epoch curve fluctuations are the same,loss is no downward trend.

zhen-he commented 5 years ago

It might be caused by the vanishing gradients during backpropagation. We have encountered this problem before, but not very frequently. In the future we'll release a stabler version for training duke, as currently I do not have enough GPUs to run the code so it might require some time :(

I have updated the loss module and you might have a try on it. The new loss uses an additional reconstruction term to make training easier.

756537479 commented 5 years ago

I try new code , start loss is raised. But training Loss also not drop, Is there any other way? thank you

zhen-he commented 5 years ago

How many iterations have you trained the model? Typically the loss starts to drop after 10000 -- 15000 iterations.

756537479 commented 5 years ago

ok, It alreay ran 3000.

zhen-he commented 5 years ago

You can watch the gradient of C_o_seq (grad_C_o_seq in the command window) during training. If it's zero, then the gradient is vanished. In normal case, it should be around 10^-6 to 10^-4.

756537479 commented 5 years ago

1.00000e-02 * 4.4214 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 [torch.cuda.FloatTensor of size 1x10 (GPU 0)]

recon: 0.998, tight: 0.001, entr: 0.001 grad_C_o_seq: 2.7886064568605207e-09 Epoch: 7.61/8000, iter: 3757/3952000, batch: 299/494, loss: 109.606

It seems to be out of scope.