Closed cmhungsteve closed 5 years ago
Hi,
Kindly note that the F1 score is slightly better with IDT features and the features in this repository are the I3D features. Here are the results for each split on Breakfast using the I3D features.
split | F1@{10} | F1@{25} | F1@{50} | Edit | Acc |
---|---|---|---|---|---|
1 | 47.8051 | 43.4331 | 34.5099 | 59.2707 | 64.1876 |
2 | 48.0872 | 44.3350 | 35.8243 | 62.1063 | 66.3950 |
3 | 55.1467 | 50.9861 | 40.2597 | 64.2048 | 68.2139 |
4 | 59.2826 | 53.4884 | 41.1707 | 61.3738 | 66.2575 |
avg | 52.5804 | 48.0606 | 37.9411 | 61.7389 | 66.2635 |
Thank you so much. Just want to double-check. You use batch size=1 and learning rate=0.0005, right?
You are welcome :) Yes, batch size=1 and learning rate=0.0005
Got it. Thank you!
I am also curious about the choice of batch size. Is there any reason you choose 1? What happens if you train with larger batch size?
Thank you.
I just enlarge the batch size to 2 and get the following results: (I also tried to use batch size=1)
batch size | F1@{10} | F1@{25} | F1@{50} | Edit | Acc |
---|---|---|---|---|---|
2 | 62.01 | 56.39 | 43.97 | 64.05 | 63.64 |
1 | 48.21 | 43.85 | 34.49 | 57.84 | 63.78 |
yours | 52.58 | 48.06 | 37.94 | 61.74 | 66.26 |
They look very different from what you shared. I am not sure which step I made mistakes......
ps. I used the features you provided.
I think bz=1 was converging faster and works pretty well. But maybe it's worth it to try different batch size. I don't know why your results are different. Is it the case for all splits? Please let me know if you figure it out.
I downloaded your code again and just tried different batch size numbers. (I didn't change the learning rate) Here are the Breakfast result I got (split 1):
experiments | F1@{10} | F1@{25} | F1@{50} | Edit | Acc | training loss after 50 epochs |
---|---|---|---|---|---|---|
mine (bz=8) | 68.7934 | 63.3908 | 51.9681 | 68.5808 | 67.4688 | 0.397244 |
mine (bz=1) | 41.0637 | 37.6623 | 30.5813 | 58.1833 | 68.8587 | 2.066658 |
yours (bz=1) | 47.8051 | 43.4331 | 34.5099 | 59.2707 | 64.1876 |
For bz=1, I think the results are OK, the difference may come from different devices, pytorch version (I use 1.1.0) or some other random issues. However, it's weird that the results are very different when I train with bz=8. Actually, I found the loss is much smaller with bz=8...... I am curious about what happens if you try bz=8 on your machine.
Thank you.
The results look interesting! it seems that the accuracy is stable though. However, regarding the loss, the value of the loss is normalized by the number of examples rather than number of iterations. This means that you need to scale it by the bz. If you do that then it's not really smaller. One last remark, I wouldn't rely on the results of split 1 to analyze the effect of the batch size. This split has the least number of test videos compared to the other splits. Regarding the variations in the results, pytorch/python version might be the reason. I'll try to run the code with bz=8 and see how my results compares to yours ASAP.
Thank you. I tried to run the same experiments again using pytorch 0.4.1, and I got this:
experiments | F1@{10} | F1@{25} | F1@{50} | Edit | Acc |
---|---|---|---|---|---|
mine (bz=8) | 73.4040 | 68.5140 | 56.6694 | 72.4442 | 70.5741 |
mine (bz=1) | 50.3233 | 45.9871 | 36.7060 | 62.6980 | 66.8303 |
yours (bz=1) | 47.8051 | 43.4331 | 34.5099 | 59.2707 | 64.1876 |
For bz=1, the numbers are quite similar to your papers, but I am still not sure why the numbers increase so much after enlarging the batch size......
Is there any update for your experiments on the Breakfast dataset?
I got similar results to yours with bz=8 on split 1. I'm not sure if you would get similar improvements on the other splits or datasets. Have you tested that?
I got similar improvements on other splits for Breakfast, but for the other two datasets, batch size doesn't make such big difference.
Yeah, I remember that I tested different values for the batch size and it was not affecting the results. But I never tried that on breakfast as the experiments take much longer time compared to the other datasets.
I guess it's not easy to find this issue since you are the first one to use F1 and Edit score to evaluate on Breakfast.
@cmhungsteve Hi, did you solve the problem that the results of Breakfast are much worse than the paper? I also tried to replicate the experimental results on Breakfast dataset, but I got 53.74(Acc) with split3.
Actually, my results are much better than the paper in terms of F1 scores and Edit scores. Acc is pretty similar, though.
@yabufarha Hello, It seems that F1, edit, and acc results in your paper are averaged among 4 splits on breakfast dataset?( i.e. train 4 individual models rather than train ONE model using all data in 4 train splits and test on merged test splits) Why do you choose to do so, is it an official benchmark rule?
Dear @jszgz , Yes, this is the standard protocol for the benchmark. Kindly note that these splits are different partitioning of the same data. So if you merge all the splits, you would basically train on the whole data and no testing data would be left.
Thank you so much @yabufarha , my negligence for such a stupid question...
Hello, I tried to replicate the experimental results as shown in Table 10 in the paper. The results of GTEA and 50Salads look OK, but the results of Breakfast are much worse than the paper. I also found that for Breakfast, the performance for each split is very different. Did you meet that situation? It would be great if you can share the performance of each split for the Breakfast dataset.
Thank you.