yandexdataschool / Practical_RL

A course in reinforcement learning in the wild
The Unlicense
5.87k stars 1.68k forks source link

[Week09] Migrate to gymnasium #533

Closed laktionov closed 1 year ago

laktionov commented 1 year ago
review-notebook-app[bot] commented 1 year ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

laktionov commented 1 year ago

Hi, thank you for your question. The main reason for switching is an exception after an attempt to use the old versions of Mujoco envs: DeprecatedEnv: Environment version v0 for Ant is deprecated. Please use Ant-v4 instead.

Speaking about the possible issues:

  1. PPO still achieves the total reward of 1500 onHalfCheetah-v4, I haven't noted any differences in reward or observation design. https://gymnasium.farama.org/environments/mujoco/half_cheetah/
  2. TD3 and SAC achieve the total reward of 2200 and 3800 respectively on Ant-v4 which differs from Ant-v0 at least in the reward design ( _contactcost is excluded on Ant-v4). Probably we should adjust the reward threshold. https://gymnasium.farama.org/environments/mujoco/ant/
dniku commented 1 year ago

I'm somewhat confused as I don't see any thresholds in the notebook -- which thresholds are you referring to?

laktionov commented 1 year ago

Sorry, I meant these thresholds to fully complete the assignments:

In ppo.ipynb

In one million of interactions it should be possible to achieve the total raw reward of about 1500

In hw-continuous-control_pytorch.ipynb

Your goal is to reach at least 1000 average reward during evaluation after training in this ant environment (since this is a new hometask, this threshold might be updated, so at least just see if your ant learned to walk in the rendered simulation)

dniku commented 1 year ago

I'm not sure what the reward was on v0 with TD3/SAC -- however, it's probably fine to submit the notebooks as-is if reward>1000 correlates with behavior that is much better than random.

Is reward>1000 indicative of the agent performing well?

laktionov commented 1 year ago

I've checked SAC agent which achieves the reward of 1063, it seems to perform well based on the video recording.

dniku commented 1 year ago

Thanks!