Closed kriskrisliu closed 1 year ago
Hi, my sincere apologies for this bug. The checkpointing was disabled when we test our immigrated code (cause the model size is too large, saving needs time and storage). You can modify the code as: https://github.com/Flash-321/ARLDM/blob/f44277744517041ac9a955794c4f3a5f73d59eb9/main.py#L403 And it should only save the last checkpoint.
You can also customize checkpoint saving behavior according to: https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.callbacks.ModelCheckpoint.html
Thanks a lot for pointing out this bug!
Fortunately, I only took a 5-epoch training. Thanks for reply~ BTW, I train the model with 4x A100 and find that it takes ~6 hours per epoch. Does it sound OK? How long does it take for a whole training (let's say 50-epoch with 8x A100) ?
I seems a little bit slow. I trained the model in a 8 A100 node, and it spends 2-3 days to finish the trainnig. I suggest you checking if the 4 A100 GPUs are on a same node. What's more do check if the backward time is much longer than the forward time (usually the time should be the same), you can easily setup a profiler to check the time cost for each component (see this https://pytorch-lightning.readthedocs.io/en/1.6.4/advanced/profiler.html). And It seems the num_workers is set to 16 in your config. For me, it is much slower that 4, maybe you can test this setting in your device.
I ran the training process with config file as following. Everything looked well during training. However, when the training end, I found no ckpt file in
ckpt_dir
. Did I miss anything?