yalidu / liir

Learning Individual Intrinsic Reward in MARL
https://papers.nips.cc/paper/8691-liir-learning-individual-intrinsic-reward-in-multi-agent-reinforcement-learning
63 stars 19 forks source link

liir doesn't learn any thing in Capture Target Domain #7

Closed yuchen-x closed 4 years ago

yuchen-x commented 4 years ago

Hi,

I run liir in the Capture Target domain, where two agents have to capture a moving target simultaneously in a grid world with only +1 terminal reward. I performed decent hyper-parameter tuning, however, it doesn't learn anything.

I found the "mask_alive" (line68 liir_learner.py) made all available actions to be 0, which cause the log_pi_taken (line 99 liir_learner.py) to be 0 in the end. So there was no gradient at all. Is this a bug, or any other suggestion?

Thanks!

yalidu commented 4 years ago

Hi,

The sparse reward setting was not considered in developing LIIR. I think the performance comes from the fact the learning of intrinsic reward relies on extrinsic reward, which is sparse in your setting. If you are interested, I think some heuristic reward shaping method like RND may be helpful for the sparse reward setting.

That ''mask_alive'' was used to keep only alive agents in calculating losses. That should not be a problem of sparse reward. It's helpful to check that variable before it is feed to the replay buffer.

Thanks!

yuchen-x commented 4 years ago

Hi,

Thank you very much for the explanation!

mask_alive = 1.0 - avail_actions1[:, :, 0]

I guess you are using each agent's first action to check whether it is still alive or not, and this works for SMAC but not for general multi-agent envs. In my case, all agents' are always alive and actions are available all the time. I realize this operation will mask out all alive agents that causes zero gradient.

After modifying it to

mask_alive = 1.0 * avail_actions1[:, :, 0]

liir learns reasonable performance.

Thanks!