yuqinie98 / PatchTST

An offical implementation of PatchTST: "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers." (ICLR 2023) https://arxiv.org/abs/2211.14730
Apache License 2.0
1.51k stars 262 forks source link

Memory required to pretrain on Electricity and Traffic #48

Closed linfeng-du closed 1 year ago

linfeng-du commented 1 year ago

Hi could you please share the memory required to pretrain on Electricity and Traffic datasets? Seems it cannot fit into a single 32GB V100 for either of the datasets due to the number of variates they have. Did you apply distributed training or were you able to pretrain on a single A40? Thanks!

yuqinie98 commented 1 year ago

Hi, it depends on the prediction length, batch size, look back window.... But generally speaking for those two large datasets we often use 4 (could be up to 8) 3090 or A5000 GPUs. Also we decrease the batch size. For the other datasets, one 3090 would be sufficient.

linfeng-du commented 1 year ago

Thank you for your quick reply! However, I am specifically referring to the pre-training phase which does not have prediction length since we're reconstructing masked patches.

Also I noticed that the default context length for pre-training is 512, which is different from the look-back window length used in downstream forecasting tasks. Just would like to confirm if this is intended.

namctin commented 1 year ago

Hi Linfeng, for these large datasets, we trained on A100 gpus with 80Gb memory. These datasets contain large number of variates, so 32GB seems not sufficient unless you reduce the number of input token.