Performance on Breakfast

cmhungsteve commented 5 years ago

Hello, I tried to replicate the experimental results as shown in Table 10 in the paper. The results of GTEA and 50Salads look OK, but the results of Breakfast are much worse than the paper. I also found that for Breakfast, the performance for each split is very different. Did you meet that situation? It would be great if you can share the performance of each split for the Breakfast dataset.

Thank you.

yabufarha commented 5 years ago

Hi,

Kindly note that the F1 score is slightly better with IDT features and the features in this repository are the I3D features. Here are the results for each split on Breakfast using the I3D features.

split	F1@{10}	F1@{25}	F1@{50}	Edit	Acc
1	47.8051	43.4331	34.5099	59.2707	64.1876
2	48.0872	44.3350	35.8243	62.1063	66.3950
3	55.1467	50.9861	40.2597	64.2048	68.2139
4	59.2826	53.4884	41.1707	61.3738	66.2575
avg	52.5804	48.0606	37.9411	61.7389	66.2635

cmhungsteve commented 5 years ago

Thank you so much. Just want to double-check. You use batch size=1 and learning rate=0.0005, right?

yabufarha commented 5 years ago

You are welcome :) Yes, batch size=1 and learning rate=0.0005

cmhungsteve commented 5 years ago

Got it. Thank you!

cmhungsteve commented 5 years ago

I am also curious about the choice of batch size. Is there any reason you choose 1? What happens if you train with larger batch size?

Thank you.

cmhungsteve commented 5 years ago

I just enlarge the batch size to 2 and get the following results: (I also tried to use batch size=1)

batch size	F1@{10}	F1@{25}	F1@{50}	Edit	Acc
2	62.01	56.39	43.97	64.05	63.64
1	48.21	43.85	34.49	57.84	63.78
yours	52.58	48.06	37.94	61.74	66.26

They look very different from what you shared. I am not sure which step I made mistakes......

ps. I used the features you provided.

yabufarha commented 5 years ago

I think bz=1 was converging faster and works pretty well. But maybe it's worth it to try different batch size. I don't know why your results are different. Is it the case for all splits? Please let me know if you figure it out.

cmhungsteve commented 5 years ago

I downloaded your code again and just tried different batch size numbers. (I didn't change the learning rate) Here are the Breakfast result I got (split 1):

experiments	F1@{10}	F1@{25}	F1@{50}	Edit	Acc	training loss after 50 epochs
mine (bz=8)	68.7934	63.3908	51.9681	68.5808	67.4688	0.397244
mine (bz=1)	41.0637	37.6623	30.5813	58.1833	68.8587	2.066658
yours (bz=1)	47.8051	43.4331	34.5099	59.2707	64.1876

For bz=1, I think the results are OK, the difference may come from different devices, pytorch version (I use 1.1.0) or some other random issues. However, it's weird that the results are very different when I train with bz=8. Actually, I found the loss is much smaller with bz=8...... I am curious about what happens if you try bz=8 on your machine.

Thank you.

yabufarha commented 5 years ago

The results look interesting! it seems that the accuracy is stable though. However, regarding the loss, the value of the loss is normalized by the number of examples rather than number of iterations. This means that you need to scale it by the bz. If you do that then it's not really smaller. One last remark, I wouldn't rely on the results of split 1 to analyze the effect of the batch size. This split has the least number of test videos compared to the other splits. Regarding the variations in the results, pytorch/python version might be the reason. I'll try to run the code with bz=8 and see how my results compares to yours ASAP.

cmhungsteve commented 5 years ago

Thank you. I tried to run the same experiments again using pytorch 0.4.1, and I got this:

experiments	F1@{10}	F1@{25}	F1@{50}	Edit	Acc
mine (bz=8)	73.4040	68.5140	56.6694	72.4442	70.5741
mine (bz=1)	50.3233	45.9871	36.7060	62.6980	66.8303
yours (bz=1)	47.8051	43.4331	34.5099	59.2707	64.1876

For bz=1, the numbers are quite similar to your papers, but I am still not sure why the numbers increase so much after enlarging the batch size......

cmhungsteve commented 5 years ago

Is there any update for your experiments on the Breakfast dataset?

yabufarha commented 5 years ago

I got similar results to yours with bz=8 on split 1. I'm not sure if you would get similar improvements on the other splits or datasets. Have you tested that?

cmhungsteve commented 5 years ago

I got similar improvements on other splits for Breakfast, but for the other two datasets, batch size doesn't make such big difference.

yabufarha commented 5 years ago

Yeah, I remember that I tested different values for the batch size and it was not affecting the results. But I never tried that on breakfast as the experiments take much longer time compared to the other datasets.

cmhungsteve commented 5 years ago

I guess it's not easy to find this issue since you are the first one to use F1 and Edit score to evaluate on Breakfast.

GuoxingY commented 4 years ago

@cmhungsteve Hi, did you solve the problem that the results of Breakfast are much worse than the paper? I also tried to replicate the experimental results on Breakfast dataset, but I got 53.74(Acc) with split3.

cmhungsteve commented 4 years ago

Actually, my results are much better than the paper in terms of F1 scores and Edit scores. Acc is pretty similar, though.

jszgz commented 4 years ago

@yabufarha Hello, It seems that F1, edit, and acc results in your paper are averaged among 4 splits on breakfast dataset?( i.e. train 4 individual models rather than train ONE model using all data in 4 train splits and test on merged test splits) Why do you choose to do so, is it an official benchmark rule?

yabufarha commented 4 years ago

Dear @jszgz , Yes, this is the standard protocol for the benchmark. Kindly note that these splits are different partitioning of the same data. So if you merge all the splits, you would basically train on the whole data and no testing data would be left.

jszgz commented 4 years ago

Thank you so much @yabufarha , my negligence for such a stupid question...

yabufarha / ms-tcn

Performance on Breakfast #7