yuqinie98 / PatchTST

An offical implementation of PatchTST: "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers." (ICLR 2023) https://arxiv.org/abs/2211.14730
Apache License 2.0
1.51k stars 262 forks source link

Long run time? #30

Closed Eliav2479 closed 1 year ago

Eliav2479 commented 1 year ago

I want to congratulate you for the great patch transformer paper.

I want to ask a question: I have a dataset which i hold as a pandas dataframe.

Given some window size I want to predict the next time step. Be This means I want to predict only a single step into the future:

Given X1,...,Xt Predict Xt+1

This means I want to predict only a single step into the future.

As I understand if I want to use your model for the task I will need to have as many forward iterations as the dataset size since you are not using a casual mask in the transformer.

How can this be resolved?

Thanks

ikvision commented 1 year ago

For a single step into the future, would that help parser.add_argument('--target_points', type=int, default=1, help='forecast horizon')

The current masking in the code is random https://github.com/yuqinie98/PatchTST/blob/e66adfdd4cc5ed9760bbfbfc6bf68d5afc82cbc6/PatchTST_self_supervised/src/callback/patch_mask.py#L113 Do you have a casual mask pytorch implementation you are considering?

Eliav2479 commented 1 year ago

This does not address my question. I was talking about run time issues

ikvision commented 1 year ago

Training time can be solved in many different ways - multi-gpu, larger batch size, faster data-loader... Why do you think that causal mask is your main bottle neck?

Eliav2479 commented 1 year ago

Please read the question

ikvision commented 1 year ago

To make it clear, I didn't write this code/paper, I am like you - using it. In the open source community it is not always easy to understand each other. I would suggest to be kinder in order to get assistance

Eliav2479 commented 1 year ago

When you have a window size of H and a causal mask you can predict H tokens in a single pass.

ikvision commented 1 year ago

Indeed the methods is patch based, it might to be the best fit for predicting a single data point You might want to to use only the pre-training with patch to create embedding. For the second stage (fine-tunning) you can have a very simple regression from embedding predicting a single time step (1 layer NN without patches)

Eliav2479 commented 1 year ago

I would suggest to wait for the authors for a response. Thank you for replying.

yuqinie98 commented 1 year ago

Thanks for asking @Eliav2479 and sorry for the late reply. Unfortunately we do not understand your question very well so we would appreciate if you could explain more of your concern. We basically agree with the solution that @ikvision proposed if you want to apply it to multiple-step prediction. Or you would just directly do multiple-step forecasting (DMS rather than IMS in this paper https://arxiv.org/pdf/2205.13504.pdf). The input is X1,...,Xt and output is Xt+1,..,Xt+T, which is done in one pass.

DIKSHAAGARWAL2015 commented 1 year ago

any estimate on how long it will take to run supervised and self-supervised learning based on default model and params.

yuqinie98 commented 1 year ago

It varies on different datasets, epochs, GPU... thus it would be hard to answer. The fastest one may take half an hour while the largest model takes a day. @DIKSHAAGARWAL2015