Using CUDA with the device is more time-consuming than using CPU?

yinkaaiwu commented 1 year ago

Here is the relax_log.txt when I running demonstration.ipynb

Although the number of steps is different, it's clear that when device=CPU, both training time and NN relaxation time are much lower than when device=cuda. I'd like to ask for your insights on the possible reasons for this. Do you have any optimization ideas or relevant demos? Thank you.

CPU： Step 0: get groud truth data Step 0: groud truth data calculation done [2.411, 2.71, 2.411, 2.673, 2.817, 2.713, 2.583, 2.455, 2.356, 2.481] Step 0: start training Step 0: training done, time: 65.63558578491211 s Step 0: start NN relaxation Step 0: NN relaxation done, time: 31.031465530395508 s

Step 1: get groud truth data Step 1: groud truth data calculation done max force for each configuration: [0.82, 0.876, 0.821, 0.884, 1.108, 0.701, 1.04, 0.65, 0.682, 1.071] Step 1: start training Step 1: training done, time: 146.79284381866455 s Step 1: start NN relaxation Step 1: NN relaxation done, time: 24.376449584960938 s

Step 2: get groud truth data Step 2: groud truth data calculation done max force for each configuration: [0.049, 0.265, 0.033, 0.257, 0.313, 0.029, 0.271, 0.034, 0.03, 0.316] Step 2: start training Step 2: training done, time: 126.58962988853455 s Step 2: start NN relaxation Step 2: NN relaxation done, time: 13.029428720474243 s

Step 3: get groud truth data Step 3: groud truth data calculation done max force for each configuration: [0.037, 0.031, 0.032, 0.033, 0.029]

GPU： Step 0: get groud truth data Step 0: groud truth data calculation done max force for each configuration: [2.411, 2.71, 2.411, 2.673, 2.817, 2.713, 2.583, 2.455, 2.356, 2.481] Step 0: start training Step 0: training done, time: 191.37967467308044 s Step 0: start NN relaxation Step 0: NN relaxation done, time: 47.26152324676514 s

Step 1: get groud truth data Step 1: groud truth data calculation done max force for each configuration: [0.419, 0.509, 0.659, 0.638, 0.925, 0.406, 0.962, 0.493, 0.336, 0.711] Step 1: start training Step 1: training done, time: 345.57629013061523 s Step 1: start NN relaxation Step 1: NN relaxation done, time: 60.44816279411316 s

Step 2: get groud truth data Step 2: groud truth data calculation done max force for each configuration: [0.038, 0.031, 0.029, 0.034, 0.033, 0.044, 0.034, 0.033, 0.027, 0.041]

yinkaaiwu commented 1 year ago

I made some modifications to your code so that it can use the GPU as the device. Below is the part I changed in the 'train_agent.py' file. Additionally, I only made changes here.

def move_data_to_device(data, device):
    if data is None:
        return None
    for key, value in data.items():
        if isinstance(value, torch.Tensor):
            data[key] = value.to(device)
    return data

class Agent(object):
    def __init__(self, train_data, valid_data, scale_const, model_path, test_data=None, layer_nodes=[10, 10],
                 activation=['tanh', 'tanh'], lr=1, max_iter=20, history_size=100, device=torch.device('cuda:0')):
        """
        scale_const: energy scaling factor
        layer_nodes: list of int, # of nodes in the each layer
        activation: str, "tanh", "Sigmoid" or "relu"
        lr, max_iter and history_size: float, int, int， parameters for LBFGS optimization method in pytorch
        device: torch.device, cpu or cuda
        """
        n_element = train_data['b_e_mask'].size(2)
        n_fp = train_data['b_fp'].size(2)

        self.train_data = move_data_to_device(train_data, device)
        self.valid_data = move_data_to_device(valid_data, device)
        self.test_data = test_data
        self.scale_const = scale_const

        self.model = BPNN(n_fp, layer_nodes, activation, n_element).to(device)
        self.optimizer = torch.optim.LBFGS(self.model.parameters(), lr=lr, max_iter=max_iter,
                                           history_size=history_size, line_search_fn='strong_wolfe')

        self.model_path = model_path

jkitchin commented 1 year ago

I am not sure why it would take longer with Cuda. We have moved on to using OCP (https://github.com/Open-Catalyst-Project/ocp) models instead of this. See Musielewicz, J., Wang, X., Tian, T., & Ulissi, Z. (2022). Finetuna: fine-tuning accelerated molecular simulations. Machine Learning: Science and Technology, 3(3), 03–01. http://dx.doi.org/10.1088/2632-2153/ac8fe0 for the approach we recommend instead.

yinkaaiwu commented 1 year ago

I am not sure why it would take longer with Cuda. We have moved on to using OCP (https://github.com/Open-Catalyst-Project/ocp) models instead of this. See Musielewicz, J., Wang, X., Tian, T., & Ulissi, Z. (2022). Finetuna: fine-tuning accelerated molecular simulations. Machine Learning: Science and Technology, 3(3), 03–01. http://dx.doi.org/10.1088/2632-2153/ac8fe0 for the approach we recommend instead.

Thank you for you reply, I Think using SGD or Adam is better than LBFGS on cuda. I don't know why, but my test shows such results.

yilinyang1 / NN-ensemble-relaxer

Using CUDA with the device is more time-consuming than using CPU? #4