How to use your approach for downstream forecasting tasks

StatMixedML commented 2 years ago

Summary

Thanks for making the code available. I really like the idea of first learning the embeddings in a self-supervised manner and then using a simpler model for forecasting. However, I am struggling how to use the learned embeddings for the forecasting part.

Problem Description

Say you are tasked with forecasting a monthly univariate time series Y = (y1, ..., yT), which is historically available from January.2010 until December.2020. The task is to forecast 2021, with the forecasting horizon being h=12 months. Based on your framework, we are using the TCN-Encoder to learn the embeddings for January.2010 until December.2020. For training of the downstream forecasting model, say a Ridge Regression Model, we are using the final timestamp of the learned representations. So far so good.

@yuezhihan & @linytsysu My questions is: given the representations and the trained Ridge model, how do we forecast 2021, since the data and hence representations are available until end of 2020 only? More specifically, what are the features for the Ridge model used for forecasting 2021?

In your Paper, Section C.2 you state that

For each task, we only use the training set to train the representation model, and apply the model to the testing set to get representations

Does this mean you show the actual test-data to the model, create the representations/embeddings based on the test-data and then use these to fit the same test-data? Isn't this a simple interpolation of the test-data, using the representations instead of the actuals, rather than forecasting?

I highly appreciate your comments on this. Many thanks.

zhihanyue commented 2 years ago

Hope this pseudo code helps to clarify the training and inference process for forecasting.

repr_model = train_ts2vec(train_x)
train_repr = repr_model.inference(train_x)
ridge_model = train_ridge(train_repr, train_target)
test_repr = repr_model.inference(test_x)
test_pred = ridge_model.inference(test_repr)

StatMixedML commented 2 years ago

@yuezhihan Thanks for the clarification. I do have some follow up-questions on this.

Looking at the forecasting code, do you use the entire data-set (train+valid+test) to create the representations? If so, does this mean you show the actual test-data to the embedding model?
Even though the test data is not used to train the Ridge-model, the representations of the test set are used for forecasting. Hence, isn't there some sort of information leakage for the Ridge-model, since it is using the learned representations of the test data as input features to forecast / interpolate the test-data.
I still don't get how to forecast Jan 2021 - Dec 2021 using train data January.2010 until December.2020 from the above example. Is there any way you can create a simulated time-series dataset and show how to forecast 2021 in an example notebook? I can also provide the data if needed.

I would greatly appreciate your help on this.

zhihanyue commented 2 years ago

@StatMixedML

We show test data to embedding model on inference rather than training. All code in tasks folder is only used to evaluate the trained TS2Vec model.
No. Because the embedding model and ridge model are all trained with training set.
We apply casual inference. For instance, TS2Vec and the ridge model is trained with data before 2021-01-01. If we want to get forecast for 2021-02-10 ~ 2021-02-15, the input data only includes timestamps before 2021-02-10. The trained TS2Vec model forwards on this input and get representations for data before 2021-02-10. "We use r_t, the representation of the last timestamp, to predict future observations." Therefore, the trained ridge model takes the the representation at 2021-02-09 as input and forecasts 2021-02-10 ~ 2021-02-15.

StatMixedML commented 2 years ago

@yuezhihan Let me try to illustrate the problem. I am using the Electricity consumption as an example, where the aim is to forecast a univariate time series 24 steps ahead. First we load data according to the dataloader

# Load Data
task_type = "forecasting"
data, train_slice, valid_slice, test_slice, scaler, pred_lens, n_covariate_cols = datautils.load_forecast_csv("electricity", univar=True)

train_data = data[:, train_slice]
validation_data = data[:, valid_slice]
test_data = data[:, test_slice]

Investigating the shapes yields

# Data Shapes: n_time_series x n_timestamps x n_features
print(f'Train Data:  {data.shape}')
print(f'Validation Data:  {validation_data.shape}')
print(f'Test Data:  {test_data.shape}')

Train Data:  (1, 26304, 8)
Validation Data:  (1, 5261, 8)
Test Data:  (1, 5261, 8)

We can also plot the data, where each colour indicates the different data types: train/validation/test

We now train a model according to the implementation

# Train Model
config = dict(batch_size=8,
              lr=0.001,
              output_dims=320,
             )

model = TS2Vec(input_dims=train_data.shape[-1],
               device=0,
               **config
              )

loss_log = model.fit(train_data,
                     verbose=False
                    )

Based on the script, we can now train a Ridge Model and forecast 24 steps ahead. The following shows the evaluation

# Forecast and evaluate model
pred_lens = [24]
out, eval_res = tasks.eval_forecasting(model, data, train_slice, valid_slice, test_slice, scaler, pred_lens, n_covariate_cols)
eval_res
{'ours': {24: {'norm': {'MSE': 0.2507473551759355, 'MAE': 0.2857496451016042},
   'raw': {'MSE': 147.02915179062174, 'MAE': 6.919412921689274}}},
 'ts2vec_infer_time': 11.58575439453125,
 'lr_train_time': {24: 0.5190908908843994},
 'lr_infer_time': {24: 0.0}}

So far so good. Now comes the part that I don’t understand. Using the forecast script, I go through each step individually to highlight the parts I have some difficulties with. We use the entire dataset to encode the time series information. The causal part takes care of no information leakage, as I understand it.

# Encode
from tasks import _eval_protocols as eval_protocols
from tasks.forecasting import cal_metrics, generate_pred_samples
pred_len = pred_lens[0]
padding = 200

all_repr = model.encode(
    data,
    casual=True,         # casual: When this param is set to True, the future informations would not be encoded into representation of each timestamp.
    sliding_length=1,
    sliding_padding=padding,
    batch_size=256
)

Following your implementation, we now step-by-step train a Ridge model evaluate it on the validation data to infer an optimal alpha and then use the embeddings of the test data as an input of the model to forecast 24 steps ahead

# https://github.com/yuezhihan/ts2vec/blob/main/tasks/forecasting.py#L46
train_repr = all_repr[:, train_slice]
valid_repr = all_repr[:, valid_slice]
test_repr = all_repr[:, test_slice]

train_data = data[:, train_slice, n_covariate_cols:]
valid_data = data[:, valid_slice, n_covariate_cols:]
test_data = data[:, test_slice, n_covariate_cols:]

ours_result = {}
lr_train_time = {}
lr_infer_time = {}
out_log = {}

train_features, train_labels = generate_pred_samples(train_repr, train_data, pred_len, drop=padding)
valid_features, valid_labels = generate_pred_samples(valid_repr, valid_data, pred_len)
test_features, test_labels = generate_pred_samples(test_repr, test_data, pred_len)

t = time.time()
lr = eval_protocols.fit_ridge(train_features, train_labels, valid_features, valid_labels)
lr_train_time[pred_len] = time.time() - t

t = time.time()
test_pred = lr.predict(test_features)
lr_infer_time[pred_len] = time.time() - t

ori_shape = test_data.shape[0], -1, pred_len, test_data.shape[2]
test_pred = test_pred.reshape(ori_shape)
test_labels = test_labels.reshape(ori_shape)

if test_data.shape[0] > 1:
    test_pred_inv = scaler.inverse_transform(test_pred.swapaxes(0, 3)).swapaxes(0, 3)
    test_labels_inv = scaler.inverse_transform(test_labels.swapaxes(0, 3)).swapaxes(0, 3)
else:
    test_pred_inv = scaler.inverse_transform(test_pred)
    test_labels_inv = scaler.inverse_transform(test_labels)

out_log[pred_len] = {
    'norm': test_pred,
    'raw': test_pred_inv,
    'norm_gt': test_labels,
    'raw_gt': test_labels_inv
}
ours_result[pred_len] = {
    'norm': cal_metrics(test_pred, test_labels),
    'raw': cal_metrics(test_pred_inv, test_labels_inv)
}

eval_res = {
    'ours': ours_result,
    'lr_train_time': lr_train_time,
    'lr_infer_time': lr_infer_time
}
eval_res
{'ours': {24: {'norm': {'MSE': 0.2507473551759355, 'MAE': 0.2857496451016042},
   'raw': {'MSE': 147.02915179062174, 'MAE': 6.919412921689274}}},
 'lr_train_time': {24: 0.5760939121246338},
 'lr_infer_time': {24: 0.0}}

Comparing the results with the above shows they do match. Now here is where I struggle: Using the generate_pred_samples function for the test_data yields the following shapes

test_features, test_labels = generate_pred_samples(test_repr, test_data, pred_len)
print(f'Test Features:  {test_features.shape}')
print(f'Test Labels:  {test_labels.shape}')
Test Features:  (5237, 320)
Test Labels:  (1, 5237, 24, 1)

The shape of the test features makes sense since it is 5261-24(pred_h)=5237. However, for the test labels, we also have 5237 timestamps, each with the length of the forecasting horizon. How can that be? Shouldn’t we have rather a shape of (1,1,24,1) since we want to forecast 24 steps ahead. I don’t understand how we can have 5237x24 forecasts.

Showing some forecasts yields this

plt.plot(test_pred[0,1,:,0], color="green")
plt.plot(test_labels[0,1,:,0], color="red")

plt.plot(test_pred[0,2,:,0], color="green")
plt.plot(test_labels[0,2,:,0], color="red")

It seems like as if this is a rolling forecast evaluation of 5327-24-steps ahead forecast. Indeed, plotting the first of the 24-step dimension with length 5237 yields

plt.plot(test_pred[0,:,1,0], color="green")
plt.plot(test_labels[0,:,1,0], color="red")

Another example of the 10-th entry

plt.plot(test_pred[0,:,10,0], color="green")
plt.plot(test_labels[0,:,10,0], color="red")

In fact, I checked the generate_pred_samples function that decides on the shape of the train/valid/test set. What strikes me is this

def generate_pred_samples(features, data, pred_len, drop=0):
    n = data.shape[1]
    features = features[:, :-pred_len] # # remove features from prediction horizon, to avoid leakage
    labels = np.stack([ data[:, i:1+n+i-pred_len] for i in range(pred_len)], axis=2)[:, 1:] # # 0: 5238 / 1: 5239 / 2: 5240  / ... / 23: 5261
    features = features[:, drop:]
    labels = labels[:, drop:]
    return features.reshape(-1, features.shape[-1]), \
            labels.reshape(-1, labels.shape[2]*labels.shape[3])

While features[:, :-pred_len] makes sense, since it removes the 24-last features from the test set, labels = np.stack([ data[:, i:1+n+i-pred_len] for i in range(pred_len)], axis=2)[:, 1:] is what I don’t understand. Running the code that generates the labels yields this

for i in range(pred_len):
    print(f'{i}: {1+n+i-pred_len}')
0: 5238
1: 5239
2: 5240
3: 5241
4: 5242
5: 5243
6: 5244
7: 5245
8: 5246
9: 5247
10: 5248
11: 5249
12: 5250
13: 5251
14: 5252
15: 5253
16: 5254
17: 5255
18: 5256
19: 5257
20: 5258
21: 5259
22: 5260
23: 5261

We generate 24 slices of the data, each with length of 5237. Now my question is: since we keep all features, except the last 24, how can we evaluate the model on the first 0:5238 labels since this is exactly overlapping with the set of features? Same goes with 1:5239, …, since this exactly overlaps with the features?

The point is hence: isn’t there information leakage for the evaluation?

How can we have 5237x24 forecasts?

Why do we compare the 0:5238, 1:5239 labels with the actuals if the set of features used to create the forecasts is based on 0:5238?

zhihanyue commented 2 years ago

@StatMixedML

This is a rolling evaluation. For each timestamp, we need to evaluate its 24-step-ahead forecasting performance. 5237 is the number of samples. For each sample, "x" is the 320-length embedding at the previous timestamp. "y" is the future 24 timepoints. Therefore x has a shape of (5237, 320), and y has a shape of (5237, 24). This evaluation way is the same as Informer.

np.stack([ data[:, i:1+n+i-pred_len] for i in range(pred_len)], axis=2)[:, 1:] is to get pred_len ahead timepoints for each timepoint. You can try

data = np.arange(100)[np.newaxis, :]
pred_len = 10
np.stack([ data[:, i:1+100+i-pred_len] for i in range(pred_len)], axis=2)[:, 1:]

Out:

array([[[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],  # sample 1: given 0, forecast 1 to 10
        [ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11],  # sample 2: given 0 1, forecast 2 to 11
        [ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12], # sample 3: given 0 1 2, forecast 3 to 12
        [ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13],  # ......
        [ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14],
        [ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15],
        [ 7,  8,  9, 10, 11, 12, 13, 14, 15, 16],
        [ 8,  9, 10, 11, 12, 13, 14, 15, 16, 17],
        [ 9, 10, 11, 12, 13, 14, 15, 16, 17, 18],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
        [12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
        [13, 14, 15, 16, 17, 18, 19, 20, 21, 22],
        [14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
        [15, 16, 17, 18, 19, 20, 21, 22, 23, 24],
        [16, 17, 18, 19, 20, 21, 22, 23, 24, 25],
......

StatMixedML commented 2 years ago

@yuezhihan Thanks for the detailed answer. Things are getting clearer now.

If I understand you correctly, based on the rolling evaluation, we use

features[0:1,320] to forecast labels[1,24]
features[0:2,320] to forecast labels[2,24]
.
.
.
features[0:5237,320] to forecast labels[5237,24]

Does this mean,

we increase the length of features by 1, or would the set of features always be the embeddings t-1 prior to the forecast evaluation
What I still don't understand is this features = features[:, :-pred_len], where we remove the last 24 (=pred_len) observations from the set of features and train the Ridge model on these. However, we start the evaluation of the forecasts already at an overlapping interval (0:5237) with the set of features, since only the last 24 are removed (5238:5261). Hence, the ridge model sees all the (0:5237) time series embeddings from the evaluation period. Is this correct?
Is each of the forecasts a 1-step ahead forecast or does your approach allow to use t-1 embeddings and directly forecast all sequences of a 24 step-ahead forecast without the need to have all in-between time-series embeddings available?

zhihanyue commented 2 years ago

@yuezhihan Thanks for the detailed answer. Things are getting clearer now.

If I understand you correctly, based on the rolling evaluation, we use
features[0:1,320] to forecast labels[1,24]
features[0:2,320] to forecast labels[2,24]
.
.
.
features[0:5237,320] to forecast labels[5237,24]
Does this mean,

we increase the length of features by 1, or would the set of features always be the embeddings t-1 prior to the forecast evaluation

What I still don't understand is this features = features[:, :-pred_len], where we remove the last 24 (=pred_len) observations from the set of features and train the Ridge model on these. However, we start the evaluation of the forecasts already at an overlapping interval (0:5237) with the set of features, since only the last 24 are removed (5238:5261). Hence, the ridge model sees all the (0:5237) time series embeddings from the evaluation period. Is this correct?

Is each of the forecasts a 1-step ahead forecast or does your approach allow to use t-1 embeddings and directly forecast all sequences of a 24 step-ahead forecast without the need to have all in-between time-series embeddings available?

The 5237 samples are independent. Ridge learns a "320 => 24" mapping.

features[0,0:320] to forecast labels[0,0:24]
features[1,0:320] to forecast labels[1,0:24]
.
.
.
features[5236,0:320] to forecast labels[5236,0:24]

where

features[i] = ts2vec_embed(ori_time_series[:i])
labels[i] = (ori_time_series[i], ori_time_series[i+1], ..., ori_time_series[i+23])

StatMixedML commented 2 years ago

@yuezhihan Many thanks for your great and detailed explanations, very much appreciated!! I have now a much better understanding.

If I may raise another question: based on your code example

features[0,0:320] to forecast labels[0,0:24]
features[1,0:320] to forecast labels[1,0:24]

you are training the Ridge model on a single timestamp representation only, assuming that the TCN encoder summarised all historical information in this single representation. I was wondering if you have tested to also include more timestamp-representations for Ridge model training, similar to a context-length used in deep-learning models for time series forecasting, somehting like

features[-10:0,0:320] to forecast labels[0,0:24]
features[-9:1,0:320] to forecast labels[1,0:24]

zhihanyue commented 2 years ago

@yuezhihan Many thanks for your great and detailed explanations, very much appreciated!! I have now a much better understanding.

If I may raise another question: based on your code example
features[0,0:320] to forecast labels[0,0:24]
features[1,0:320] to forecast labels[1,0:24]
you are training the Ridge model on a single timestamp representation only, assuming that the TCN encoder summarised all historical information in this single representation. I was wondering if you have tested to also include more timestamp-representations for Ridge model training, similar to a context-length used in deep-learning models for time series forecasting, somehting like
features[-10:0,0:320] to forecast labels[0,0:24]
features[-9:1,0:320] to forecast labels[1,0:24]

No. Because the proposed hierarchical contrasting enforces the TCN to encode features at various scales. Many previous works specified context window sizes. However, different context window lengths may be suited to different datasets. We want to show that without hyper-parameter optimization we can also achieve better performance, in an adaptive way.

StatMixedML commented 2 years ago

@yuezhihan I applied your approach to the M5 forecasting competition data, for the L-10 hierarchy. However, I see some flat forecasts that are not adaptive at all to the series behaviour

Also, using a simple LightGBM model for a randomly selected series easily outperforms your appproach

where the black is the original series, the red is TS2VEC forecast and the orange is the LightGBM forecast . It would be great if you could double check the way I use your approach based on the attached self-contained notebook.

Very much appreciated.

Attachement: TS2Vec_M5_Example.zip

StatMixedML commented 2 years ago

@yuezhihan Kindly asking for an update on this. Thanks

ratheile commented 2 years ago

@StatMixedML Hi, did you experiment further with this model? Since OP did not respond, I would kindly ask you if you could get reasonable results.

StatMixedML commented 2 years ago

I haven‘t pursued using the approach any further since I haven‘ received feedback from @yuezhihan.

@ratheile Do you have some experience with using the approach in real out-of-sample evaluations?

ratheile commented 2 years ago

Hi @StatMixedML , I am currently exploring this method as a preprocessor for downstream tasks. My data shows strong periodic (daily / weekly / yearly) trends so I do not expect a complete out-of-sample prediction as the expectation is that most of the data is in-distribution. (I assume you question the generalization of this approach)

Ill report back if the method is useful for this scenario!

kwuking commented 2 years ago

I haven‘t pursued using the approach any further since I haven‘ received feedback from @yuezhihan.

@ratheile Do you have some experience with using the approach in real out-of-sample evaluations?

Do you currently use this method on M5 data, and the effect is not good? I'm currently using this method and also have some doubts about the performance of ts2vec. I have used other datasets and found that the performance is not better than common time series forecasting algorithms such as the above LightGBM, DeepAR

kwuking commented 2 years ago

Hi @StatMixedML , I am currently exploring this method as a preprocessor for downstream tasks. My data shows strong periodic (daily / weekly / yearly) trends so I do not expect a complete out-of-sample prediction as the expectation is that most of the data is in-distribution. (I assume you question the generalization of this approach)

Ill report back if the method is useful for this scenario!

Are you currently using this method to forecast and have get good results? Hope to see your related reports.

StatMixedML commented 2 years ago

@kwuking Even though I really like the approach of first learning time series representations in an self-supervised manner and then using a simple model for forecasting, I see some mixed results when it comes to its real world applications, e.g., using the M5 dataset. Because if this, I haven't used the model any further.

Yet, there are other implementations using a similar approach

All of them report improvements over fully end-to-end trained networks, so I believe there is some benefit in using these models. Yet, I am not sure if all of them are evaluated in a truly out-of-sample manner, since the decoder uses the test data to create the representations. Even though they use causal TCNs, where the future values are masked, this is not fully out-of-sample, since the future is commonly unknown.

kwuking commented 2 years ago

m report improvements over fully end-to-end trained networks, so I believe there is some benefit in using these models. Yet, I am not sure if all of them are evaluated in a truly out-of-sample manner, since the decoder uses the test data to create the representations. Even though they use causal TCNs, where the future values are masked, this is not fully out-of-sample, since the future is commonly unknown.

@StatMixedML I double checked the train and test code to sure all of them are evaluated in a truly out-of-sample manner, no doubt model-enocde is indeed being evaluated in the correct way, guaranteeing no future data is used. But real question is that downstream forecasting task (such as ridge or simple mlp) gets the right results from even not trained ts2vec . It reveals that the representation from the model does not learn useful information to facilitate downstream tasks. Furthermore, I analyzed the representation from the ts2vec and made a scatter plot. It is found that there are indeed different clusters in the scatter plot, indicating that some difference among time series has indeed been learned, but actually current learned information does not facilitate downstream forecasting. Based on the above analysis, one of my thoughts is that the model needs to be fine-tuned on the downstream prediction task to get ideal results. I'm running a related experiment and will report back once I have the results.

The following is a representation in the stage of training shown by tensorborad

StatMixedML commented 1 year ago

@kwuking, @ratheile Have you been able to get some good results using the model?

zifei-yu commented 1 year ago

Hello, I encountered a problem with inconsistent data dimensions when using ETTh1 data for forecasting. Could you tell me how to solve it? Thank you very much.

kwuking commented 1 year ago

@kwuking, @ratheile Have you been able to get some good results using the model? In fact, I tried different improved methods based on ts2ve, but did not achieve the desired results. I recently investigated the latest papers such as TimesNets, patchTST, Dlinear, etc. These papers revealed that the improvement of representation for prediction is actually very limited, especially in the TimesNets paper, ablation experiments were conducted and found that for prediction tasks, the most effective is the last prediction layer, not the previous representation extraction layer, so I have given up this type of method for prediction tasks. As far as forecasting is concerned at present, end-to-end forecasting models are the right path for now.

kwuking commented 1 year ago

Hello, I encountered a problem with inconsistent data dimensions when using ETTh1 data for forecasting. Could you tell me how to solve it? Thank you very much. What exactly is the problem?

StatMixedML commented 1 year ago

@kwuking Thanks for your comments.

I still believe that self-supervised approaches offer some benefits over a fully, end-to-end trained model. Since TS2Vec is using a relatively „simple“ Ridge-Model as the „final layer“ that creates forecasts, maybe using a non-linear „prediction layer“ that also models interactions as for end-to-end models might increase accuracy. Sure, it is not a „layer“ as in any deep learning model. It should rather be a more realistic model that allows to model non-linearitites and interactions.

@kwuking You mentioned that you‘ve tried other models. Can you be more specific on this?

The problem is that we cannot easily disentangle the quality of the embeddings from the quality of the forecast model, i.e., Ridge. The only way we can test it is to fix the embeddings and to experiment with different forecasting models.

Other forecasting models, such as DeepAR, enrich the data with time-dependent features, such as month, year, trend, etc. and also include lagged y-features. I know this takes TS2Vec a bit too far. But given that it is mostly trained with the time series themselves, TS2Vec gives a decent forecast accuracy already.

zhihanyue / ts2vec

How to use your approach for downstream forecasting tasks #20

Summary

Problem Description