Closed jianzongwu closed 2 years ago
Hi, thanks! Yes, you are right. Evaluating every 5 epochs can be the reason (you are right that we may miss out on a better epoch if evaluating at a lower frequency). Another reason can just be the variance between different training runs. Evaluating at every epoch probably helps to get better numbers.
We used weights from the best epoch on the val set of each dataset for evaluation. Strictly speaking, this is not entirely fair when the testing set is also the val set but is otherwise fair. Also, this situation is not optimal, as it means that training does not always converge perfectly to the last epoch. But later epochs definitely produce better weights in general. The difference between the last-epoch weights and the best-epoch weights should be small.
That being said, even the best-epoch weights are just an estimate of the best weights produced during training. Because evaluation during training is done on one randomly drawn expression for each object (while there are about 3 expressions per object). This is just to save some time during training. Full evaluation on all expressions is done when running test.py.
Got it. Could you please report the average oIoU of refcoco+ and refcocog dataset, and the variance between different runs? That would be a great help.
Ok. Plan to run a few tests with the new more efficient implementation. Can report in the README then.
Hi, unfortunately, because of a change of circumstances I no longer have the compute to run multiple times on RefCOCO+ and RefCOCOg. But please take a look at the new lavt_one implementation and its log and the released weights. If you found the old implementation has a variance issue on those two datasets, then give a try to the new lavt_one implementation. It likely solves it.
Hi, thanks for the excellent work! I have run the training command in README.md to train and evaluate refcoco+ dataset, but got only about 61 in oIOU, which is slightly lower than the reported number. I guess that is because I changed the evaluation frequency to 5 epoch (evaluate per 5 epoch), and it seems in the code it is evaluating per epoch, so I'm wondering if I should turn back to the original code to get the right number (about 62). And there is another question. Should I get the result from the best epoch model or just using the final epoch model (40 epoch) for evaluation?