[Week09] Migrate to gymnasium

yandexdataschool / Practical_RL

A course in reinforcement learning in the wild

The Unlicense

5.87k stars 1.68k forks source link

[Week09] Migrate to gymnasium #533

Closed laktionov closed 1 year ago

laktionov commented 1 year ago

Support new interfaces in all files and notebooks
Change done to terminated or truncated to iterate over env
In EnvRunner use done equals to terminated or truncated since next state comes from the next episode.
Update video recording code
Replace pybullet-gym with gymnasium[mujoco]
Update documentation links
Remove assert 0 < np.mean(is_dones) < 0.1 in hw-continuous-control_pytorch.ipynb since is_done only equals to terminated now

review-notebook-app[bot] commented 1 year ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

laktionov commented 1 year ago

Hi, thank you for your question. The main reason for switching is an exception after an attempt to use the old versions of Mujoco envs: DeprecatedEnv: Environment version v0 for Ant is deprecated. Please use Ant-v4 instead.

Speaking about the possible issues:

PPO still achieves the total reward of 1500 onHalfCheetah-v4, I haven't noted any differences in reward or observation design. https://gymnasium.farama.org/environments/mujoco/half_cheetah/
TD3 and SAC achieve the total reward of 2200 and 3800 respectively on Ant-v4 which differs from Ant-v0 at least in the reward design ( _contactcost is excluded on Ant-v4). Probably we should adjust the reward threshold. https://gymnasium.farama.org/environments/mujoco/ant/

dniku commented 1 year ago

I'm somewhat confused as I don't see any thresholds in the notebook -- which thresholds are you referring to?

laktionov commented 1 year ago

Sorry, I meant these thresholds to fully complete the assignments:

In ppo.ipynb

In one million of interactions it should be possible to achieve the total raw reward of about 1500

In hw-continuous-control_pytorch.ipynb

Your goal is to reach at least 1000 average reward during evaluation after training in this ant environment (since this is a new hometask, this threshold might be updated, so at least just see if your ant learned to walk in the rendered simulation)

dniku commented 1 year ago

I'm not sure what the reward was on v0 with TD3/SAC -- however, it's probably fine to submit the notebooks as-is if reward>1000 correlates with behavior that is much better than random.

Is reward>1000 indicative of the agent performing well?

laktionov commented 1 year ago

I've checked SAC agent which achieves the reward of 1063, it seems to perform well based on the video recording.

dniku commented 1 year ago

Thanks!