Closed a3626a closed 7 years ago
Hi, @a3626a
1) Both action sampling methods are ok, see https://arxiv.org/pdf/1709.02878.pdf. I don't try randomly sampling method, and maybe it don't work in this project.
2) Use non_spatial_action[valid_actions]
to mask invalid actions. I think it's a python problem and you can test it on IPython.
at
a3c_agent.step
, action is chosen byact_id = valid_actions[np.argmax(non_spatial_action[valid_actions])]
However I think they should be chosen randomly by their probability because
non_spatial_action
andspatial_action
value refers policy value. (check this post https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2)By the way, it's still not clear when to mask invalid actions. (before soft-max? after soft-max?)