Scalable Representation Learning for Multivariate Time Series

Summary

It would be great to add Scalable Representation Learning for Multivariate Time Series as an additional feature input to models implemented in to Pytorch-TS.

Potential Benefits

Both univariate and multivariate models can potentially benefit from this approach, as it helps to bring more information to the model for prediction tasks. Also, the embeddings can be used for classiﬁcation to better understand the data at hand.

Description

The basic idea is to learn embeddings of time series from which similarities of the time series can be derived. The objective is to ensure that similar time series obtain similar representations that can be used as an input for modelling. As for image embeddings, the learned representations may also be used to define a meaningful measure between time series, e.g., comparing time series using a distance measure between their representations with dimensionality reduction and/or clustering.

The criterion to select pairs of similar time series follows word2vec’s intuition. For word embeddings, the representation of the context of a word should probably be, on one hand, close to the one of this word, and, on the other hand, distant from the one of randomly chosen words, since they are probably unrelated to the original word’s context. The corresponding loss then pushes pairs of (context, word) and (context, random word) to be linearly separable. This is called negative sampling. It can visualized as follows:

The loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.

To adapt this principle to time series, one can consider a random subseries x_ref of a given time series y_i. Then, on one hand, the representation of x_ref should be close to the one of any of its subseries x_pos (a positive example). On the other hand, if one considers another subseries x_neg (a negative example) chosen at random (in a different random time series y_j if several series are available, or in the same time series if it is long enough and not stationary), then its representation should be distant from the one of x_ref. Following the analogy with word2vec, x_pos corresponds to a word, x_ref to its context, and x_neg to a random word. To improve the stability and convergence of the training procedure as well as the experimental results of the learned representations, once can introduce, as in word2vec, several negative samples (x_neg_k) chosen independently at random.

The loss pushes the computed representations to distinguish between x_ref and x_neg, and to assimilate x_ref and x_pos. Overall, the training procedure consists in traveling through the training dataset for several epochs (possibly using mini-batches), picking tuples x_ref , x_pos ,(x_neg_k ) at random and performing a minimization step on the corresponding loss for each pair, until training ends.

Some Initial Comments

The approach needs to be a two step procedure:

We need to learn the embeddings first, i.e., train an embedding model
Once the embeddings are learned, we can incorporate them as a feat_static_real into any of the available model implementations, i.e., DeepAR, DeepVAR, TransformerTempFlowEstimator, etc.
We should also output the embeddings as they are useful in their own right

References

Unsupervised Scalable Representation Learning for Multivariate Time Series:

zalandoresearch / pytorch-ts