zalandoresearch / pytorch-ts

PyTorch based Probabilistic Time Series forecasting framework based on GluonTS backend
MIT License
1.21k stars 190 forks source link

Where to place m5 data? #28

Closed NielsRogge closed 3 years ago

NielsRogge commented 3 years ago

When I want to use

from pts.dataset.repository import get_dataset

dataset = get_dataset("m5", regenerate=False)

I get the warning that the files from Kaggle are not present in the directory: RuntimeError: M5 data is available on Kaggle (https://www.kaggle.com/c/m5-forecasting-accuracy/data). You first need to agree to the terms of the competition before being able to download the data. After you have done that, please copy the files into /root/.pytorch/pytorch-ts/datasets/m5.

However, I have no idea where to put these files. I'm working in Google Colab. The root directory is called "content". Should I make a ./pytorch/pytorch-ts/datasets/m5 directory myself?

NielsRogge commented 3 years ago

This is probably a hidden directory. When I download the solar dataset, it says the following

saving time-series into /root/.pytorch/pytorch-ts/datasets/solar_nips/train/data.json
saving time-series into /root/.pytorch/pytorch-ts/datasets/solar_nips/test/data.json

However, these directories are not shown in the file explorer, nor can be accessed from the command line.

kashif commented 3 years ago

right... so one could perhaps use the commands within jupyter to create the appropriate dirs and download the kaggle files...

Also note that there was a bug in the m5 generation which i pushed to master but i still need to make a new version... so kindly use the git master for now

NielsRogge commented 3 years ago

Ok, I ran the feature engineering (as defined in _m5.py) myself in a notebook, which created the JSON files. As I want to use TransformerTempFlow on the M5 dataset, I group the data as follows:

from pts.dataset.utils import load_datasets
from pts.dataset import MultivariateGrouper

metadata = "path_to_metadata.json"
train = "path_to_train/data.json"
test = "path_to_test/data.json"

# read in the JSON files
dataset = load_datasets(metadata, train, test, shuffle=False)

# group the series
train_grouper = MultivariateGrouper(max_target_dim=int(len(dataset.train)))

test_grouper = MultivariateGrouper(num_test_dates=1, 
                                   max_target_dim=int(len(dataset.train)))

dataset_train = train_grouper(dataset.train)
dataset_test = test_grouper(dataset.test)

When I run

entry = next(iter(dataset_train))
entry["target"].shape

this prints (30490, 1913)

So when I then want to train the model as follows:

import torch
from pts.feature.time_feature import DayOfWeek, DayOfMonth, DayOfYear, MonthOfYear, WeekOfYear
from pts.model.transformer_tempflow import TransformerTempFlowEstimator
from pts import Trainer

# time features to be used
time_features=[DayOfWeek(), DayOfMonth(), DayOfYear(), MonthOfYear(), WeekOfYear()]

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# the transformer temp flow model does currently not support additional real/cat features
transformer_estimator = TransformerTempFlowEstimator(
    d_model=16,
    num_heads=4,
    input_size=40,
    target_dim=int(len(dataset.train)),
    prediction_length=dataset.metadata.prediction_length,
    context_length=dataset.metadata.prediction_length*4,
    flow_type='MAF',
    dequantize=True,
    freq=dataset.metadata.freq,
    time_features=time_features,
    trainer=Trainer(
        device=device,
        epochs=14,
        learning_rate=1e-3,
        num_batches_per_epoch=100,
        batch_size=10,
        num_workers=0
    )
)
transformer_predictor = transformer_estimator.train(dataset_train)

I am getting the following error:

RuntimeError                              Traceback (most recent call last)
<ipython-input-9-cc847c57e271> in <module>()
----> 1 transformer_predictor = transformer_estimator.train(dataset_train)

7 frames
/usr/local/lib/python3.6/dist-packages/pts/model/estimator.py in train(self, training_data)
    146 
    147     def train(self, training_data: Dataset) -> Predictor:
--> 148         return self.train_model(training_data).predictor

/usr/local/lib/python3.6/dist-packages/pts/model/estimator.py in train_model(self, training_data)
    134             net=trained_net,
    135             input_names=get_module_forward_input_names(trained_net),
--> 136             data_loader=training_data_loader,
    137         )
    138 

/usr/local/lib/python3.6/dist-packages/pts/trainer.py in __call__(self, net, input_names, data_loader)
     50                     inputs = [data_entry[k].to(self.device) for k in input_names]
     51 
---> 52                     output = net(*inputs)
     53                     if isinstance(output, (list, tuple)):
     54                         loss = output[0]

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

/usr/local/lib/python3.6/dist-packages/pts/model/transformer_tempflow/transformer_tempflow_network.py in forward(self, target_dimension_indicator, past_time_feat, past_target_cdf, past_observed_values, past_is_pad, future_time_feat, future_target_cdf, future_observed_values)
    350 
    351         enc_out = self.transformer.encoder(
--> 352             self.encoder_input(enc_inputs).permute(1, 0, 2)
    353         )
    354 

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/linear.py in forward(self, input)
     89 
     90     def forward(self, input: Tensor) -> Tensor:
---> 91         return F.linear(input, self.weight, self.bias)
     92 
     93     def extra_repr(self) -> str:

/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in linear(input, weight, bias)
   1674         ret = torch.addmm(bias, input, weight.t())
   1675     else:
-> 1676         output = input.matmul(weight.t())
   1677         if bias is not None:
   1678             output += bias

RuntimeError: mat1 dim 1 must match mat2 dim 0
kashif commented 3 years ago

@NielsRogge so note that the m5 multivariate vector is going to be 30K per time step and together with the covariates it will become even bigger... so it will not be possible from a memory point of view to model the temporal dynamics (even though the normalizing flow can handle dimensions as big as this...)

Also did you use the github master to generate the m5 file?

NielsRogge commented 3 years ago

I copied the code from _m5.py from the master branch on Github and ran it in a Colab notebook, yes. I saw you updated the code 20 days ago, so this fix was included.

First, I had a CUDA error because of the RAM being full, but I lowered the batch size to 20. I think the error I'm getting now has something to do with the input_size.

kashif commented 3 years ago

right input_size will the size of the feature vector composed of the multivariate vector together with the covariates of order O(30K) and that is where this will not work for the m5 multivariate use case...

NielsRogge commented 3 years ago

Ok, and there's not a way to infer what input_size should be? Should I first aggregate the sales before using TransformerTempFlow?

kashif commented 3 years ago

its possible to infer the input_size but it depends on the parameters of the estimator as well as the data (different freq for example will create different sized time features) so for now the user has to specify it...

Yes so if you can create an aggregated version of M5 of say O(1K) multivariate vector then these methods should work (from a memory point of view). I talk about this issue in the last sentence of the paper... I dont know what your plans are with this? Do you want to use this for something practical or are you interested in research?

NielsRogge commented 3 years ago

We want to benchmark several forecasting algorithms, both classical approaches and state of the art, on the same dataset (M5).

kashif commented 3 years ago

I see... but currently with M5 only a univariate (point or probabilistic) forecast is possible (due to the scaling issues described above) for other smaller datasets however multivariate probabilistic is possible...