werner-duvaud / muzero-general

MuZero
https://github.com/werner-duvaud/muzero-general/wiki/MuZero-Documentation
MIT License
2.46k stars 606 forks source link

training result cannot be loaded on another machine #180

Open hairinwind opened 2 years ago

hairinwind commented 2 years ago

I don't have GPU on my local machine, so I rent a google cloud vm with GPU. I ran a couple of games there, e.g. cartpole.
I copied the result to my local and try to load the model and then do "4. Play against MuZero" on my local. But I got this error when loading the model.

Enter a number to choose a model to load: 0
Traceback (most recent call last):
  File "/home/yao/anaconda3/lib/python3.7/tarfile.py", line 187, in nti
    n = int(s.strip() or "0", 8)
ValueError: invalid literal for int() with base 8: 'rebuild_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/yao/anaconda3/lib/python3.7/tarfile.py", line 2289, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/home/yao/anaconda3/lib/python3.7/tarfile.py", line 1095, in fromtarfile
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "/home/yao/anaconda3/lib/python3.7/tarfile.py", line 1037, in frombuf
    chksum = nti(buf[148:156])
  File "/home/yao/anaconda3/lib/python3.7/tarfile.py", line 189, in nti
    raise InvalidHeaderError("invalid header")
tarfile.InvalidHeaderError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/yao/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 595, in _load
    return legacy_load(f)
  File "/home/yao/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 506, in legacy_load
    with closing(tarfile.open(fileobj=f, mode='r:', format=tarfile.PAX_FORMAT)) as tar, \
  File "/home/yao/anaconda3/lib/python3.7/tarfile.py", line 1591, in open
    return func(name, filemode, fileobj, **kwargs)
  File "/home/yao/anaconda3/lib/python3.7/tarfile.py", line 1621, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "/home/yao/anaconda3/lib/python3.7/tarfile.py", line 1484, in __init__
    self.firstmember = self.next()
  File "/home/yao/anaconda3/lib/python3.7/tarfile.py", line 2301, in next
    raise ReadError(str(e))
tarfile.ReadError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "muzero.py", line 691, in <module>
    load_model_menu(muzero, game_name)
  File "muzero.py", line 597, in load_model_menu
    checkpoint_path=checkpoint_path, replay_buffer_path=replay_buffer_path,
  File "muzero.py", line 417, in load_model
    self.checkpoint = torch.load(checkpoint_path)
  File "/home/yao/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 426, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/home/yao/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 599, in _load
    raise RuntimeError("{} is a zip archive (did you mean to use torch.jit.load()?)".format(f.name))
RuntimeError: results/cartpole/2021-12-30--00-05-02/model.checkpoint is a zip archive (did you mean to use torch.jit.load()?)

Thanks for any help!

ahainaut commented 2 years ago

Hello @hairinwind , I have successfully trained a model on a machine with gpu and loaded it on a machine without gpu so I don't think this is the problem. However, you should make sure that you have the same versions of pytorch on the two machines used for training and loading. Hope this helps !