zubatyuk / aimnet2

AIMNet2: Fast, accurate and transferable neural network interatomic potential
13 stars 3 forks source link

Providing model weights does not resume from checkpoint #4

Closed JSLJ23 closed 2 months ago

JSLJ23 commented 2 months ago

I am attempting to resume from a checkpoint due to a timeout run on slurm as the training is taking extremely long and multi-GPU doesn't seem to be offering any speedup.

aimnet train \
    --config /path/to/config.yaml \
    --load /path/to/model_checkpoint_10.pt

But the training just restarts from epoch 1. Is there anything else that needs to be done to support resuming the training from the checkpoint file?

zubatyuk commented 2 months ago

Hi @JSLJ23

training is taking extremely long

Please define long. How much time per step per sample? The number of training steps and training samples would depend on the complexity of the dataset.

multi-GPU doesn't seem to be offering any speedup

Note, that in DDP setting, your effective batch size would scale with the number of GPUs.

But the training just restarts from epoch 1.

You should see in the log INFO:root:Loading weights from file. Ignite state is not saved, just model weights. Please adjust the LR scheduler via the config file if you need to.

JSLJ23 commented 2 months ago

Each epoch of about 190K molecules is taking 2 hours+ with a batch size of 1024.

For the multi-GPU runs, this warning is shown and the epoch speed is either the same or sometimes slower than running on a single GPU.

/scratch/pawsey0799/joshua_soon/mamba_envs/aimnet2_py311/lib/python3.11/site-packages/torch/autograd/graph.py:744: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1, 128], strides() = [1, 1]
bucket_view.sizes() = [1, 128], strides() = [128, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
zubatyuk commented 2 months ago

Each epoch of about 190K molecules is taking 2 hours+ with a batch size of 1024.

You're missing the dataset size here. Adjusting the right batch size for hardware and dataset is tricky. For single-GPU runs 16k atoms batch works well for datasets with average number of atoms about 30-40. In DDP setting it might result in under-utilization of GPUs. Large batch sizes could result in less accurate models.

For the multi-GPU runs, this warning is shown and the epoch speed is either the same or sometimes slower than running on a single GPU.

The reason for this warning is double-differentiation, since you're training on forces. I would welcome any idea to get rid of this warning or make buffers pre-allocated correctly to make DDP training more efficient. However, at present I do not notice any serious performance degradation in DDP setting.

zubatyuk commented 2 months ago

Each epoch of about 190K molecules is taking 2 hours+ with a batch size of 1024.

you're saying that ~190 training steps take ~2 hours? What is the molecule size distribution? Do you actually utilize GPUs? Note, by default the model complexity is O(N^2). It has O(N) mode which breaks even with O(N^2) at molecule about 250-300 atoms.

JSLJ23 commented 2 months ago

I am not sure what "training steps" refers to here but the dataset size is as:

INFO:root:Randomly train dataset into train and val datasets, sizes 174756 and 19418 10.0%.

Molecules within the dataset range from 30-60 atoms.

And yes the GPU(s) are in use.

INFO:root:AMD Instinct MI250X
zubatyuk commented 2 months ago

~200 training steps should take about a minute or less on a decent GPU. It is difficult to say what is going wrong on your side without seeing all the log files.

JSLJ23 commented 2 months ago

slurm-14854324.txt I've attached the slurm output indicating the epoch time of 2+ hours. Could you clarify what is meant by "training step"? Still don't quite get what is meant by it.

zubatyuk commented 2 months ago

In config, by default you have data.samplers.train.kwargs.batches_per_epoch = 10000. Change it to -1 to have each sample loaded exactly once per epoch, or to any other number which would control frequency of your validation epochs.

JSLJ23 commented 2 months ago

Great, thanks for the tip! Setting the data.samplers.train.kwargs.batches_per_epoch to -1 resulted in each epoch only taking about 3 minutes.