Open jtamir opened 5 years ago
Also having this one
I might need to update that post, but run these demos instead:
https://github.com/williamFalcon/pytorch-lightning/tree/master/examples/multi_node_examples
could you post your code and error?
I'm kind of looking for a way to evaluate each set of hyperparams on a separate GPU in parallel, not train a single model on multiple GPUs. I've tried this:
def train_one(hparam, gpu_id_set):
# load data, create model, create logger and checkpoint callback
trainer = Trainer(logger=tt_logger, checkpoint_callback=checkpoint_callback,
gpus=[int(gpu_id_set)], max_nb_epochs=hparam.epochs, weights_summary=None)
trainer.fit(model)
trainer.test(model)
hparams.optimize_parallel_gpu(train_one, max_nb_trials=400, gpu_ids=gpu_ids)
Here gpu_ids
is ['0', '1']
.
And here's the output:
gpu available: True, used: True
VISIBLE GPUS: 0
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp line=54 error=3 : initialization error
Caught exception in worker thread cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp:54
Traceback (most recent call last):
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
results = train_function(trial_params, gpu_id_set)
File "main.py", line 40, in train_one
trainer.fit(model)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/pytorch_lightning-0.5.1.3-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 754, in fit
self.__single_gpu_train(model)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/pytorch_lightning-0.5.1.3-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 793, in __singl
e_gpu_train
model.cuda(self.root_gpu)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 311, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 208, in _apply
module._apply(fn)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 208, in _apply
module._apply(fn)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 230, in _apply
param_applied = fn(param)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 311, in <lambda>
return self._apply(lambda t: t.cuda(device))
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/cuda/__init__.py", line 179, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp:54
After this one there's exact same message, but with GPU 1. After this process seems to hang. Being killed by Ctrl-C it also outputs
Traceback (most recent call last):
File "main.py", line 65, in <module>
hparams.optimize_parallel_gpu(train_one, max_nb_trials=400, gpu_ids=gpu_ids)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 323, in optimize_parallel_gpu
results = self.pool.map(optimize_parallel_gpu_private, self.trials)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/multiprocessing/pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/multiprocessing/pool.py", line 651, in get
self.wait(timeout)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/multiprocessing/pool.py", line 648, in wait
self._event.wait(timeout)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/threading.py", line 552, in wait
signaled = self._cond.wait(timeout)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/threading.py", line 296, in wait
waiter.acquire()
KeyboardInterrupt```
Any help? Having similar issue but without multi-node (I have two GPUs in my PC).
Traceback (most recent call last):
File "D:/Users//MFT/MFT/simulation_runner.py", line 40, in <module>
hparams.optimize_parallel_gpu('test', gpu_ids=['0'], max_nb_trials=1)
File "D:\Users\\anaconda3\envs\pytorch\lib\site-packages\test_tube\argparse_hopt.py", line 322, in optimize_parallel_gpu
self.pool = Pool(processes=nb_workers, initializer=init, initargs=(gpu_q,))
File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\context.py", line 119, in Pool
context=self.get_context())
File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\pool.py", line 174, in __init__
self._repopulate_pool()
File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\pool.py", line 239, in _repopulate_pool
w.start()
File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
reduction.dump(process_obj, to_child)
File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'HyperOptArgumentParser.optimize_parallel_gpu.<locals>.init'
and the entry:
def main(hparams):
early_stopping = EarlyStopping('val_acc', patience=20)
trainer = Trainer(
max_nb_epochs=50,
gpus=[0],
early_stop_callback=early_stopping,
train_percent_check=1,
check_val_every_n_epoch=1,
val_percent_check=1
)
system = ParkinsonDecisionSystem(hparams)
if hparams.evaluate:
trainer.run_evaluation()
else:
trainer.fit(system)
if __name__ == '__main__':
parent_parser = HyperOptArgumentParser(strategy='grid_search')
parent_parser.opt_list('--augmentation', default="None", type=str, tunable=True, options=["Erosion",
"Gaussian",
"None",
"Median"])
parent_parser.add_argument('-e', '--evaluate', dest='evaluate', action='store_true',
help='evaluate model on validation set')
parent_parser.add_argument("--model_name", metavar="model_name", type=str, default=None,
help="Name od model from model_enum")
hparams = parent_parser.parse_args()
hparams.optimize_parallel_gpu(main, gpu_ids=['0'], max_nb_trials=1)
I was able to get this working, though I can't remember exactly all the steps. My code is available here: https://github.com/jtamir/deepinpy/blob/master/main.py#L113
Things I remember being important:
@jtamir
Thank you for your quick response. I was lucky and I solved this myself even sooner.
You did mention everything except one thing. In testtube I had to remove the nested functions in (when pool is creating new processes) because pickle can't handle that. This removed the error: AttributeError: Can't pickle local object 'HyperOptArgumentParser.optimize_parallel_gpu.
after that I did all your steps and it works!
Maybe I will write a blog post about that or something... Also we should make a pull request for testtube (I will look into it...).
use torch=1.13.0 pytorch-lighting=1.0.8 output: File "/home/xiaoyang/python/envs/taming/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1265, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'LightningDistributedDataParallel' object has no attribute '_sync_params'
I am following the guide to optimize hyperparameters over multiple GPUs: https://towardsdatascience.com/trivial-multi-node-training-with-pytorch-lightning-ff75dfb809bd
However, when I run the hyperparam opt, I get the following error:
Based on some reading, it seems to be an issue with initializing CUDA and multiprocessing, with the suggested change of adding
multiprocessing.set_start_method('spawn', force=True)
.Looking at
argparse_hopt.py
, I see that that specific line is commented out. When I uncomment it, I get through that error but hit a pickle error:Looking for suggestions on what to try, thanks!