Why use drop_last=True in test (and val) dataloader?

yuqinie98 / PatchTST

An offical implementation of PatchTST: "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers." (ICLR 2023) https://arxiv.org/abs/2211.14730

Apache License 2.0

1.38k stars 248 forks source link

Why use drop_last=True in test (and val) dataloader? #7

Closed oguiza closed 1 year ago

oguiza commented 1 year ago

Hi, First, thanks for the excellent paper and for sharing this repo. Great work!

I want to ask why do you set the test dataloader drop_last=True? By doing that performance is not reported for all samples in the test dataset (some samples will be dropped, which is not what we want). In addition to this, changes in the batch size would lead to reporting performance on a different number of samples. I've tested the difference by setting drop_last=False with the ILI dataset and the result is worse than the published one, although it still is the best-published result I've seen so far.

I saw the same issue in the Autoformer repo and logged an issue (https://github.com/thuml/Autoformer/issues/104). As a result they've now updated the code.

BTW, this is likely to also occur with other papers that seem to use a similar code base to Autoformer's.

namctin commented 1 year ago

Hi @oguiza, that is a great catch. Thank you very much! Our thought was to use the same setting in the dataloader they had for a fair comparison. But you are correct that setting drop_last=True will lead to the number of test samples not to be consistent with different choices of the batch size, although this difference may not cause a significant performance drop. We will change that setting in the code.

For self-supervised learning, we accidentally use the default drop_last=False which turns out to be the right choice, and the reported self-supervised results in the paper is for this setting.

oguiza commented 1 year ago

Thanks for your quick response @namctin. I just run the script on the ILI dataset and the difference was significant in my opinion. If the performance with self-supervised is measured with all test samples as you say, this might indicate that the difference between supervised and self-supervised would be bigger than what's reflected in your paper. I'm sharing this just in case you want to investigate it further.

namctin commented 1 year ago

Thank you for confirming the result @oguiza! Since ILI is a fairly small dataset, so the setting of drop_last can affect the result. I would think this will be diminished for larger datasets. But in any case, we should set the drop_last=False and we will fix that in the code.

ts-kim commented 1 year ago

I discovered that a number of the performance gain in the supervised patchTST comes from using 'drop_last=True' for the test dataset.

With 'drop_last=False', for the ETTh1, Traffic, and Illness datasets, there was a drop in performance, while there was no drop for the Weather dataset. I have not yet tested on the other datasets.

I think this can be a significant problem, as it would require the most of the tables in the paper to be rewritten and this could be considered as a cheating.