werner-duvaud / muzero-general

MuZero
https://github.com/werner-duvaud/muzero-general/wiki/MuZero-Documentation
MIT License
2.49k stars 611 forks source link

Resume after error #64

Closed MuMaxAI closed 4 years ago

MuMaxAI commented 4 years ago

Hi after running for a few days, the training suddenly failed.

2020-08-05 07:56:18,288 ERROR worker.py:1049 -- listen_error_messages_raylet: Connection closed by server.
E0805 07:56:18.288650  6428  6052 task_manager.cc:320] Task failed: IOError: 2: Stream removed: Type=ACTOR_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=self_play, class_name=SelfPlay, function_name=continuous_self_play, function_hash=}, task_id=6170691ebdfaeef67e0a4dfc0100, job_id=0100, num_args=4, num_returns=2, actor_task_spec={actor_id=7e0a4dfc0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
Windows fatal exception: code 1073807366

E0805 07:56:18.398025  6428  6052 task_manager.cc:320] Task failed: IOError: 2: Stream removed: Type=ACTOR_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=trainer, class_name=Trainer, function_name=continuous_update_weights, function_hash=}, task_id=cd8f5689d0aa5a3945b95b1c0100, job_id=0100, num_args=4, num_returns=2, actor_task_spec={actor_id=45b95b1c0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
Windows fatal exception: code 1073807366

2020-08-05 07:56:18,288 ERROR import_thread.py:93 -- ImportThread: Connection closed by server.
2020-08-05 07:56:18,273 ERROR worker.py:949 -- print_logs: Connection closed by server.
E0805 07:56:18.413650  6428  6052 task_manager.cc:320] Task failed: IOError: 2: Stream removed: Type=ACTOR_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=self_play, class_name=SelfPlay, function_name=continuous_self_play, function_hash=}, task_id=6f53dca1f451ca9444ee453c0100, job_id=0100, num_args=4, num_returns=2, actor_task_spec={actor_id=44ee453c0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
Windows fatal exception: code 1073807366

E0805 07:56:18.429273  6428  6052 task_manager.cc:320] Task failed: IOError: 2: Stream removed: Type=ACTOR_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=self_play, class_name=SelfPlay, function_name=continuous_self_play, function_hash=}, task_id=9fc77bf30b43899755c3b2b60100, job_id=0100, num_args=6, num_returns=2, actor_task_spec={actor_id=55c3b2b60100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=0}
Windows fatal exception: code 1073807366

E0805 07:56:18.429273  6428  6052 task_manager.cc:320] Task failed: IOError: 14: failed to connect to all addresses: Type=ACTOR_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=shared_storage, class_name=SharedStorage, function_name=get_info, function_hash=}, task_id=9a5697535e67ba54ef0a6c220100, job_id=0100, num_args=0, num_returns=2, actor_task_spec={actor_id=ef0a6c220100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=372967}
Windows fatal exception: code 1073807366

2020-08-05 07:56:18,538 WARNING worker.py:879 -- Could not push exception to redis.
Traceback (most recent call last):
  File "muzero.py", line 312, in <module>
    muzero.train()
  File "muzero.py", line 103, in train
    self._logging_loop(shared_storage_worker, replay_buffer_worker)
  File "muzero.py", line 154, in _logging_loop
    info = ray.get(shared_storage_worker.get_info.remote())
  File "C:\Program Files\Python36\lib\site-packages\ray\worker.py", line 1476, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
F0805 07:56:24.171499  6428  6052 service_based_gcs_client.cc:104]  Check failed: num_attempts < RayConfig::instance().gcs_service_connect_retries() No entry found for GcsServerAddress
Windows fatal exception: code 1073807366

*** Check failure stack trace: ***
    @   00007FFE76A94A4C  public: __cdecl google::LogMessage::~LogMessage(void) __ptr64
    @   00007FFE76909914  public: virtual __cdecl google::NullStreamFatal::~NullStreamFatal(void) __ptr64
    @   00007FFE76953317  public: void __cdecl google::NullStreamFatal::`vbase destructor'(void) __ptr64
    @   00007FFE76952A20  public: void __cdecl google::NullStreamFatal::`vbase destructor'(void) __ptr64
    @   00007FFE76958EC1  public: void __cdecl google::NullStreamFatal::`vbase destructor'(void) __ptr64
    @   00007FFE76AD9ACC  void __cdecl google::InstallFailureWriter(void (__cdecl*)(char const * __ptr64,int))
    @   00007FFE76AD98EF  void __cdecl google::InstallFailureWriter(void (__cdecl*)(char const * __ptr64,int))
    @   00007FFE768AD636  public: class google::LogMessageVoidify & __ptr64 __cdecl google::LogMessageVoidify::operator=(class google::LogMessageVoidify const & __ptr64) __ptr64
    @   00007FFE7687E4FD  public: class google::LogMessageVoidify & __ptr64 __cdecl google::LogMessageVoidify::operator=(class google::LogMessageVoidify const & __ptr64) __ptr64
    @   00007FFE76927970  public: void __cdecl google::NullStreamFatal::`vbase destructor'(void) __ptr64
    @   00007FFE76921510  public: void __cdecl google::NullStreamFatal::`vbase destructor'(void) __ptr64
    @   00007FFE7692145B  public: void __cdecl google::NullStreamFatal::`vbase destructor'(void) __ptr64
    @   00007FFE768B3D41  public: class google::LogMessageVoidify & __ptr64 __cdecl google::LogMessageVoidify::operator=(class google::LogMessageVoidify const & __ptr64) __ptr64
    @   00007FFE7687E3F9  public: class google::LogMessageVoidify & __ptr64 __cdecl google::LogMessageVoidify::operator=(class google::LogMessageVoidify const & __ptr64) __ptr64
    @   00007FFE99351FFA  _o_exp
    @   00007FFE9AA87974  BaseThreadInitThunk
    @   00007FFE9C6EA271  RtlUserThreadStart

I have the model and the events.out.tfevents.XXXXXXXXX.user-PC.XXXX, is there anyway to resume the training?

ahainaut commented 4 years ago

Hi, It depends on the optimizer you used during training. If you used SGD, you can choose option "Load pretrained model" after choosing your environment, and then you have to enter the path of your model and replay_buffer (optional) saved in the results folder corresponding to your training. You can then choose the "Train" option. If you used Adam, it is currently not possible to resume training because the actual version doesn't save the optimizer's parameters, as Adam's parameters change during training, but we are currently working on adding this feature.

MuMaxAI commented 4 years ago

Unfortunately, the trainer crashed before writing the replay_buffer.