yuqinie98 / PatchTST

An offical implementation of PatchTST: "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers." (ICLR 2023) https://arxiv.org/abs/2211.14730
Apache License 2.0
1.51k stars 262 forks source link

multivariate? #29

Closed kashif closed 1 year ago

kashif commented 1 year ago

hello! In the paper, you state you have a multivariate method, however as far as I understand each variate (or channel) is processed independently and the emission is also a point forecasting emission which is independent.

Can you kindly clarify what part is multivariate? As far as I understand the only multivariate aspect is the input data being a vector of size M at each time point, however, I see this as a negative since after making your patches you end up with M * number of patches vectors, and thus the compute and memory via the vanilla transformer encoder is quadratic in M? If you had univariate inputs then at least you do not have the issue of O(M^2)....

Thank you for any insight!

namctin commented 1 year ago

As mentioned in the paper, we provide another framework for multivariate time series forecasting/regression/classification where channels can go into the Transformer in parallel instead of mixing. We use vanilla Transformer to demonstrate, but that does not mean we need to only stick with that. You can come up with any other architecture that can exploit the correlation structure in your data. We will soon put another code for that. The computation and memory is not quadratic with M but linear.

Thank you for the question!

kashif commented 1 year ago

thank you for your quick response.... I believe the issue is exactly that the channels (variates) are going in parallel as you say, in the batch dimensions, and then afterward reshaped back to the multivariate dimension that makes PatchTST a univariate model and the source of my confusion when reading your paper.

https://github.com/yuqinie98/PatchTST/blob/main/PatchTST_supervised/layers/PatchTST_backbone.py#L164

Just because during training you give the model batches with all the variates and during inference you predict all the variates in parallel does not make the model a multivariate model. In-fact at inference time all deep-learning based univariate models predict the variates in appropriately sized batches.

The recent TSMixer paper from google which cites PatchTST also puts this model in the univariate taxonomy.

Also note that the metrics you report in Table 3 are not quite correct, since for example, MAE is in the unit of the data and if you check the datasets you consider, Traffic is the only one which is in the range [0. 1]. I believe you are reporting the NMAE and NMSE.

yuqinie98 commented 1 year ago

Hello,

First, as we mentioned in paper, we are dealing with "multivariate time series forecasting task". And the way we want to solve it is to do "channel-independence". You can also refer to this paper https://arxiv.org/pdf/2205.13504.pdf and see the usage of "multivariate", as they also use a "channel-independence" structure. The TSMixer paper you mentioned categorizes PatchTST as "multivariate input" and "multivariate output", which is just the thing that we have claimed in paper. So the term is not misused.

Second, for the metrics we are reporting the results with normalized data, to make it consistent with the result tables in all the previous baseline papers, such as https://github.com/cure-lab/LTSF-Linear, https://github.com/zhouhaoyi/Informer2020, https://github.com/thuml/Autoformer, https://github.com/MAZiqing/FEDformer.

Thanks again for your attention and comments on our work!

lesego94 commented 1 year ago

Hi @kashif I am working on a master's thesis and was also concerned about channel independence as well. Thank you both for Clarifying this issue. Kashif, could you suggest an alternative model that I could use? the TSMixer paper is great but they haven't made their code available. Thank you!

kashif commented 1 year ago

sorry for the late reply @yuqinie98 and @namctin I went over your paper again and realize that you are referring to the dataset and task, however reading the paper gave me the impression that the model was multivariate... the model is univariate and in fact, all univariate models can be used for the "multivariate forecasting" task by just predicting each variate independently (which you call "channel independence").

The tsmixer's table you refer to refers to the fact that the model is able to take a multivariate input and output (which is the property as mentioned above of any univariate model since it can take the variates in the batch dimension and predict the multivariate vector independently as done here). I was more referring to

Screenshot 2023-03-27 at 19 46 42 and Screenshot 2023-03-27 at 19 47 25 which was the source of my confusion which is cleared I suppose.

Also, I apologize for baiting you with the metric remark (which is valid) since I knew you would say that it's from the works you mentioned which I suppose was my point that bad conventions or confusion of notations spread to the point now where anyone reading your paper or others with similar comparisons will not know that the data is actually standardized and the metrics are over the standardized test set (ie NMAE and NMSE). Your paper I believe does not mention this fact and neither do a number of others. Anyways I will close this issue.

@lesego94 go with PatchTST, I had no technical concerns, I was just confused.

kashif commented 1 year ago

ah sorry, you mention it in appendix B1! My bad I'll stop embarrassing myself for one day! Ah my bad x2 I had the TSmixer paper open...

lesego94 commented 1 year ago

Guys.. I am still trying to understand. SImple question. Does the information of one time series affect the prediction of another time series when using PatchTST. Basically I'm looking for spatio-temporal capabilities. Similar to space-time-former. @kashif ? image

Bascially what I'm working on, is improving forecasts by including other time time-series that can provide additional information.

kashif commented 1 year ago

the information of one time series affects the prediction in the sense that you have a single model trained over all the time series... just like an image classification model is trained over a lot of independent images by giving the images in the batch dim:

https://github.com/yuqinie98/PatchTST/blob/main/PatchTST_supervised/layers/PatchTST_backbone.py#L164

At inference time the predictions are also made independently of each time series in PatchTST. I.e. the emissions are independent too, but the learned weights used to make the predictions come from looking over all the time series:

https://github.com/yuqinie98/PatchTST/blob/main/PatchTST_supervised/layers/PatchTST_backbone.py#LL169

So again, the model will implicitly learn the relationships of the different time series (i.e. spatial-temporal relationships) since it's a shared (or global model) but the outputs will be independent and they are reshaped/concated as in the diagram 1(a).

I have also implemented a probabilistic patchTST here: https://github.com/awslabs/gluonts/pull/2748 with a few features not in the original model.

lesego94 commented 1 year ago

Thank you! ill check out your modification aswell.