yabufarha / ms-tcn

Other
214 stars 58 forks source link

Performance on Breakfast #7

Closed cmhungsteve closed 5 years ago

cmhungsteve commented 5 years ago

Hello, I tried to replicate the experimental results as shown in Table 10 in the paper. The results of GTEA and 50Salads look OK, but the results of Breakfast are much worse than the paper. I also found that for Breakfast, the performance for each split is very different. Did you meet that situation? It would be great if you can share the performance of each split for the Breakfast dataset.

Thank you.

yabufarha commented 5 years ago

Hi,

Kindly note that the F1 score is slightly better with IDT features and the features in this repository are the I3D features. Here are the results for each split on Breakfast using the I3D features.

split F1@{10} F1@{25} F1@{50} Edit Acc
1 47.8051 43.4331 34.5099 59.2707 64.1876
2 48.0872 44.3350 35.8243 62.1063 66.3950
3 55.1467 50.9861 40.2597 64.2048 68.2139
4 59.2826 53.4884 41.1707 61.3738 66.2575
avg 52.5804 48.0606 37.9411 61.7389 66.2635
cmhungsteve commented 5 years ago

Thank you so much. Just want to double-check. You use batch size=1 and learning rate=0.0005, right?

yabufarha commented 5 years ago

You are welcome :) Yes, batch size=1 and learning rate=0.0005

cmhungsteve commented 5 years ago

Got it. Thank you!

cmhungsteve commented 5 years ago

I am also curious about the choice of batch size. Is there any reason you choose 1? What happens if you train with larger batch size?

Thank you.

cmhungsteve commented 5 years ago

I just enlarge the batch size to 2 and get the following results: (I also tried to use batch size=1)

batch size F1@{10} F1@{25} F1@{50} Edit Acc
2 62.01 56.39 43.97 64.05 63.64
1 48.21 43.85 34.49 57.84 63.78
yours 52.58 48.06 37.94 61.74 66.26

They look very different from what you shared. I am not sure which step I made mistakes......

ps. I used the features you provided.

yabufarha commented 5 years ago

I think bz=1 was converging faster and works pretty well. But maybe it's worth it to try different batch size. I don't know why your results are different. Is it the case for all splits? Please let me know if you figure it out.

cmhungsteve commented 5 years ago

I downloaded your code again and just tried different batch size numbers. (I didn't change the learning rate) Here are the Breakfast result I got (split 1):

experiments F1@{10} F1@{25} F1@{50} Edit Acc training loss after 50 epochs
mine (bz=8) 68.7934 63.3908 51.9681 68.5808 67.4688 0.397244
mine (bz=1) 41.0637 37.6623 30.5813 58.1833 68.8587 2.066658
yours (bz=1) 47.8051 43.4331 34.5099 59.2707 64.1876

For bz=1, I think the results are OK, the difference may come from different devices, pytorch version (I use 1.1.0) or some other random issues. However, it's weird that the results are very different when I train with bz=8. Actually, I found the loss is much smaller with bz=8...... I am curious about what happens if you try bz=8 on your machine.

Thank you.

yabufarha commented 5 years ago

The results look interesting! it seems that the accuracy is stable though. However, regarding the loss, the value of the loss is normalized by the number of examples rather than number of iterations. This means that you need to scale it by the bz. If you do that then it's not really smaller. One last remark, I wouldn't rely on the results of split 1 to analyze the effect of the batch size. This split has the least number of test videos compared to the other splits. Regarding the variations in the results, pytorch/python version might be the reason. I'll try to run the code with bz=8 and see how my results compares to yours ASAP.

cmhungsteve commented 5 years ago

Thank you. I tried to run the same experiments again using pytorch 0.4.1, and I got this:

experiments F1@{10} F1@{25} F1@{50} Edit Acc
mine (bz=8) 73.4040 68.5140 56.6694 72.4442 70.5741
mine (bz=1) 50.3233 45.9871 36.7060 62.6980 66.8303
yours (bz=1) 47.8051 43.4331 34.5099 59.2707 64.1876

For bz=1, the numbers are quite similar to your papers, but I am still not sure why the numbers increase so much after enlarging the batch size......

cmhungsteve commented 5 years ago

Is there any update for your experiments on the Breakfast dataset?

yabufarha commented 5 years ago

I got similar results to yours with bz=8 on split 1. I'm not sure if you would get similar improvements on the other splits or datasets. Have you tested that?

cmhungsteve commented 5 years ago

I got similar improvements on other splits for Breakfast, but for the other two datasets, batch size doesn't make such big difference.

yabufarha commented 5 years ago

Yeah, I remember that I tested different values for the batch size and it was not affecting the results. But I never tried that on breakfast as the experiments take much longer time compared to the other datasets.

cmhungsteve commented 5 years ago

I guess it's not easy to find this issue since you are the first one to use F1 and Edit score to evaluate on Breakfast.

GuoxingY commented 4 years ago

@cmhungsteve Hi, did you solve the problem that the results of Breakfast are much worse than the paper? I also tried to replicate the experimental results on Breakfast dataset, but I got 53.74(Acc) with split3.

cmhungsteve commented 4 years ago

Actually, my results are much better than the paper in terms of F1 scores and Edit scores. Acc is pretty similar, though.

jszgz commented 4 years ago

@yabufarha Hello, It seems that F1, edit, and acc results in your paper are averaged among 4 splits on breakfast dataset?( i.e. train 4 individual models rather than train ONE model using all data in 4 train splits and test on merged test splits) Why do you choose to do so, is it an official benchmark rule?

yabufarha commented 4 years ago

Dear @jszgz , Yes, this is the standard protocol for the benchmark. Kindly note that these splits are different partitioning of the same data. So if you merge all the splits, you would basically train on the whole data and no testing data would be left.

jszgz commented 4 years ago

Thank you so much @yabufarha , my negligence for such a stupid question...