Multiprocessing error running optimize_parallel_gpu with pytorch + pytorch-lightning

jtamir commented 5 years ago

I am following the guide to optimize hyperparameters over multiple GPUs: https://towardsdatascience.com/trivial-multi-node-training-with-pytorch-lightning-ff75dfb809bd

However, when I run the hyperparam opt, I get the following error:

RuntimeError: cuda runtime error (3) : initialization error at /pytorch/aten/src/THC/THCGeneral.cpp:54

Based on some reading, it seems to be an issue with initializing CUDA and multiprocessing, with the suggested change of adding multiprocessing.set_start_method('spawn', force=True).

Looking at argparse_hopt.py, I see that that specific line is commented out. When I uncomment it, I get through that error but hit a pickle error:

AttributeError: Can't pickle local object 'HyperOptArgumentParser.optimize_parallel_gpu.<locals>.init'

Looking for suggestions on what to try, thanks!

antvconst commented 5 years ago

Also having this one

williamFalcon commented 5 years ago

I might need to update that post, but run these demos instead:

https://github.com/williamFalcon/pytorch-lightning/tree/master/examples/multi_node_examples

williamFalcon commented 5 years ago

could you post your code and error?

antvconst commented 5 years ago

I'm kind of looking for a way to evaluate each set of hyperparams on a separate GPU in parallel, not train a single model on multiple GPUs. I've tried this:

def train_one(hparam, gpu_id_set):
    # load data, create model, create logger and checkpoint callback
    trainer = Trainer(logger=tt_logger, checkpoint_callback=checkpoint_callback,
                      gpus=[int(gpu_id_set)], max_nb_epochs=hparam.epochs, weights_summary=None)
    trainer.fit(model)
    trainer.test(model)

hparams.optimize_parallel_gpu(train_one, max_nb_trials=400, gpu_ids=gpu_ids)

Here gpu_ids is ['0', '1'].

And here's the output:

gpu available: True, used: True
VISIBLE GPUS: 0
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp line=54 error=3 : initialization error
Caught exception in worker thread cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp:54
Traceback (most recent call last):
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
    results = train_function(trial_params, gpu_id_set)
  File "main.py", line 40, in train_one
    trainer.fit(model)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/pytorch_lightning-0.5.1.3-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 754, in fit
    self.__single_gpu_train(model)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/pytorch_lightning-0.5.1.3-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 793, in __singl
e_gpu_train
    model.cuda(self.root_gpu)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 311, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 208, in _apply
    module._apply(fn)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 208, in _apply
    module._apply(fn)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 230, in _apply
    param_applied = fn(param)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 311, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/cuda/__init__.py", line 179, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp:54

After this one there's exact same message, but with GPU 1. After this process seems to hang. Being killed by Ctrl-C it also outputs


Traceback (most recent call last):
  File "main.py", line 65, in <module>
    hparams.optimize_parallel_gpu(train_one, max_nb_trials=400, gpu_ids=gpu_ids)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 323, in optimize_parallel_gpu
    results = self.pool.map(optimize_parallel_gpu_private, self.trials)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/multiprocessing/pool.py", line 651, in get
    self.wait(timeout)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/multiprocessing/pool.py", line 648, in wait
    self._event.wait(timeout)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/threading.py", line 552, in wait
    signaled = self._cond.wait(timeout)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/threading.py", line 296, in wait
    waiter.acquire()
KeyboardInterrupt```

BraveDistribution commented 4 years ago

Any help? Having similar issue but without multi-node (I have two GPUs in my PC).

Traceback (most recent call last):
  File "D:/Users//MFT/MFT/simulation_runner.py", line 40, in <module>
    hparams.optimize_parallel_gpu('test', gpu_ids=['0'], max_nb_trials=1)

  File "D:\Users\\anaconda3\envs\pytorch\lib\site-packages\test_tube\argparse_hopt.py", line 322, in optimize_parallel_gpu
    self.pool = Pool(processes=nb_workers, initializer=init, initargs=(gpu_q,))
  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\context.py", line 119, in Pool
    context=self.get_context())

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\pool.py", line 174, in __init__
    self._repopulate_pool()

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\pool.py", line 239, in _repopulate_pool
    w.start()

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)

AttributeError: Can't pickle local object 'HyperOptArgumentParser.optimize_parallel_gpu.<locals>.init'

and the entry:

def main(hparams):
    early_stopping = EarlyStopping('val_acc', patience=20)

    trainer = Trainer(
                         max_nb_epochs=50,
                         gpus=[0],
                         early_stop_callback=early_stopping,
                         train_percent_check=1,
                         check_val_every_n_epoch=1,
                         val_percent_check=1
                         )
    system = ParkinsonDecisionSystem(hparams)
    if hparams.evaluate:
        trainer.run_evaluation()
    else:
        trainer.fit(system)

if __name__ == '__main__':
    parent_parser = HyperOptArgumentParser(strategy='grid_search')
    parent_parser.opt_list('--augmentation', default="None", type=str, tunable=True, options=["Erosion",
                                                                                         "Gaussian",
                                                                                         "None",
                                                                                         "Median"])
    parent_parser.add_argument('-e', '--evaluate', dest='evaluate', action='store_true',
                               help='evaluate model on validation set')

    parent_parser.add_argument("--model_name", metavar="model_name", type=str, default=None,
                        help="Name od model from model_enum")

    hparams = parent_parser.parse_args()
    hparams.optimize_parallel_gpu(main, gpu_ids=['0'], max_nb_trials=1)

jtamir commented 4 years ago

I was able to get this working, though I can't remember exactly all the steps. My code is available here: https://github.com/jtamir/deepinpy/blob/master/main.py#L113

Things I remember being important:

Set num_workers=0 in your DataLoader or Python will try to spawn multiple Multiprocessing pools
Always pass GPU ID 0 to Pytorch Lightning's trainer, because TestTube already handles the GPU IDs: https://github.com/jtamir/deepinpy/blob/master/main.py#L46
Set distributed_backend=None in the trainer for similar reasons

BraveDistribution commented 4 years ago

@jtamir

Thank you for your quick response. I was lucky and I solved this myself even sooner.

You did mention everything except one thing. In testtube I had to remove the nested functions in (when pool is creating new processes) because pickle can't handle that. This removed the error: AttributeError: Can't pickle local object 'HyperOptArgumentParser.optimize_parallel_gpu..init'

after that I did all your steps and it works!

Maybe I will write a blog post about that or something... Also we should make a pull request for testtube (I will look into it...).

yang-xidian commented 6 months ago

use torch=1.13.0 pytorch-lighting=1.0.8 output: File "/home/xiaoyang/python/envs/taming/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1265, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'LightningDistributedDataParallel' object has no attribute '_sync_params'

williamFalcon / test-tube

Multiprocessing error running optimize_parallel_gpu with pytorch + pytorch-lightning #56