Closed linfeng-du closed 1 year ago
Hi, it depends on the prediction length, batch size, look back window.... But generally speaking for those two large datasets we often use 4 (could be up to 8) 3090 or A5000 GPUs. Also we decrease the batch size. For the other datasets, one 3090 would be sufficient.
Thank you for your quick reply! However, I am specifically referring to the pre-training phase which does not have prediction length since we're reconstructing masked patches.
Also I noticed that the default context length for pre-training is 512, which is different from the look-back window length used in downstream forecasting tasks. Just would like to confirm if this is intended.
Hi Linfeng, for these large datasets, we trained on A100 gpus with 80Gb memory. These datasets contain large number of variates, so 32GB seems not sufficient unless you reduce the number of input token.
Hi could you please share the memory required to pretrain on Electricity and Traffic datasets? Seems it cannot fit into a single 32GB V100 for either of the datasets due to the number of variates they have. Did you apply distributed training or were you able to pretrain on a single A40? Thanks!