x35f / unstable_baselines

Re-implementations of SOTA RL algorithms.
127 stars 12 forks source link

OpenAi gym integration #54

Closed Karlheinzniebuhr closed 1 year ago

Karlheinzniebuhr commented 1 year ago

I'd like to do the following but instead of SB3 I'd like to plug in unstable baselines. Is there a quick start guide or documentation somewhere that could help me get started?

import gym
from stable_baselines3 import PPO

# create the environment and wrap it in a vectorized environment
env = gym.make('MyEnv')
env = DummyVecEnv([lambda: env])

# create the PPO agent and train it on the environment
model = PPO('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)

# test the trained agent
obs = env.reset()
for i in range(100):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()
    if dones:
        break
env.close()
x35f commented 1 year ago

We are currently working on a documentation for USB. We apologize for the inconvenience. Since USB is more like a benchmark framework instead of a plugin for other user-defined training tasks, the environments are pre-defined and the algorithms cannot be imported the same way as SB3 does. The example you give is a standard algorithm in USB, where each directory in the "unstable_baselines/baselines" directory implements a baseline algorithm. You can simply modify the "trainer.py" file to change the training procedure. Also, if you wish to add your own user-defined environment, here is a step by step guide.

  1. At the first few lines of env_wrapper.py, define the global variable list containing the names of your environments, e.g. MY_ENVS = ['env1', 'env2']
  2. In the "get_env" function in env_wrapper.py, the entries to create your envs are the names of your envs. Add an "elif" statement to the block, e.g., "elif env_name in MY_ENVS: \n return your_env_create_function()"
  3. In the algorithm folder, create the configs for your envs. For example, in the "unstable_baselines/baselines/sac" directory, create "config/my_envs" directory and copy the default.py from "config/mujoco" direcotry into "config/my_envs", the create "config/my_envs/env1.py". Then modify the content of this file as overwrite_args = { "env_name": "env1" }
  4. In the algorithm you wish to run, the get_env function should return the created envs, and the algorithms should be able to automatically run. E.g., in the "unstable_baselines/baselines/sac" directory, run shell command python main.py configs/my_envs/env1.py --gpu 0.

Note that USB only support the latest gym-like environment interface. You can define your wrapper to make your environment compatible (check out classed with name "*Wrapper" in env_wrapper.py for reference).
USB currently only support single environment sampling for code simplification, thus the vecenv in SB3 in not supported. If you have any other problems, please feel free to leave comments. I will reply as soon as possible

Karlheinzniebuhr commented 1 year ago

Ok I almost got it working now but got an error. The error occurs on line 365 of the networks.py file: imagen

I inspected the self.action_dim and hidden_dims[-1] variables and they have these values: imagen

Any idea what could be the cause? My custom environment has 2 discrete actions in the action_space.

Below is my code:

# Create the environment
env = CryptoEnv(df=gymdf, window_size=window_size, frame_bound=training_frame_bound)

# Create new GYM environment for testing agent
test_env = CryptoEnv(df=gymdf, window_size=window_size, frame_bound=test_frame_bound)

# Instantiate the agent
args = {
  "env_name": "CryptoEnv",
  "buffer":{
    "max_buffer_size": 1000000
  },
  "agent":{
    "gamma": 0.99,
    "reward_scale": 5.0,
    "update_target_network_interval": 1,
    "target_smoothing_tau": 0.005,
    "num_q_networks": 10,
    "num_q_samples": 2,
    "alpha": 0.2,
    "q_network":{
      "network_params": [("mlp", 256), ("mlp", 256)],
      "optimizer_class": "Adam",
      "learning_rate":0.0003,
      "act_fn": "relu",
      "out_act_fn": "identity"
    },
    "policy_network":{
      "network_params": [("mlp", 256), ("mlp", 256)],
      "optimizer_class": "Adam",
      "deterministic": False,
      "learning_rate":0.0003,
      "act_fn": "relu",
      "out_act_fn": "identity",
      "reparameterize": True
    },
    "entropy":{
      "automatic_tuning": True,
      "learning_rate": 0.0003,
      "optimizer_class": "Adam"
    }
  },
  "trainer":{
    "max_env_steps":500000,
    "batch_size": 256,
    "max_trajectory_length":1000,
    "update_policy_interval": 20,
    "eval_interval": 10000,
    "num_eval_trajectories": 10,
    "save_video_demo_interval": -1,
    "warmup_timesteps": 5000,
    "snapshot_interval": 5000,
    "log_interval": 200,
    "utd": 20
  }
}

seed = 0
#set global seed
set_global_seed(seed)

observation_space = env.observation_space
action_space = env.action_space

#initialize buffer
buffer = ReplayBuffer(observation_space, action_space, **args['buffer'])

#initialize agent
agent = REDQAgent(observation_space, action_space, **args['agent'])

#initialize trainer
trainer  = REDQTrainer(
    agent,
    env,
    test_env, 
    buffer,
    **args['trainer']
)

trainer.train()

# Save the agent
model_name = "custom_env_test_redq.zip"
agent.save(model_name)
x35f commented 1 year ago

Currently, only the DQN algorithm in USB supports discrete action spaces and it does not require a policy network. The action_space property of your environment should be defined as gym.spaces.discrete.Discrete(2). None of the baseline algorithms uses the CategoricalPolicyNetwork and it is lagging behind the latest code structure. Sorry for the bugs you have encountered with. There are variants of baseline algorithms that extend to discrete action spaces. E.g., SAC-discrete and PPO-discrete. As far as I know, PPO-discrete has a better performance than SAC-discrete. For the REDQ algorithm you are using, the SAC-discrete implementation shares the same policy update procedure and can be easily integrated into the REDQ agent implementation. We are also working on adding discrete action spaces support based on known algorithm variants, including SAC, REDQ, and PPO, and will update in the next few days.

Karlheinzniebuhr commented 1 year ago

Ok thanks since I'm not sure I'm familiar enough to implement the SAC-discrete implementation for REDQ, I'd like to leave this issue open and kindly ask you to let me know as soon as discrete action spaces are available for REDQ. I have a feeling that this algorithm could beat PPO in the discrete spaces

x35f commented 1 year ago

Hi, sorry for the late reply. I have implemented discrete action space support for PPO, SAC and REDQ (more details in the updated README.md.) These algorihtms have been tested on simple gym environments, and the results are plotted in this figure.This svg file can be viewed by downloading it and opening it with Chrome. REDQ and Atari are quite time-consuming, so the results are not available yet.

Sadly SAC-discrete and REDQ-discrete show consistant much worse performance than PPO-discrete and DQN, and PPO-discrete is the best among them. The performance of SAC-discrete in CartPole-v1 matches other open-source implementations, so we tend to believe this is the actual capability of SAC-like algorithms in discrete action tasks, probably because it deprecates the re-parameterization trick. Maybe finetuning the hyper-parameters could help to some extent. For your discrete action environment, I would suggest try to use PPO-discrete and DQN, and adjust the network size according to the observation and action space.

Karlheinzniebuhr commented 1 year ago

Thank you so much! Just one more quick question, I'm getting an issue with my env.observation_space which is a tuple of shape (10, 5). The assertion inside get_network() fails. Am I doing a mistake by passing in a gym.spaces.box.Box object as observation_space for REDQAgent ?

image

My current code:

# Create the environment
env = CryptoEnv(df=gymdf, window_size=window_size, frame_bound=training_frame_bound)

# Create new GYM environment for testing agent
test_env = CryptoEnv(df=gymdf, window_size=window_size, frame_bound=test_frame_bound)

# Instantiate the agent
args = {
  "env_name": "",
  "env":{

  },
  "buffer":{
    "max_buffer_size": 100000
  },
  "agent":{
    "gamma": 0.99,
    "reward_scale": 5.0,
    "update_target_network_interval": 1,
    "target_smoothing_tau": 0.005,
    "num_q_networks": 10,
    "num_q_samples": 2,
    "alpha": 0.2,
    "q_network":{
      "network_params": [("mlp", 64), ("mlp", 64)],
      "optimizer_class": "Adam",
      "learning_rate":0.0003,
      "act_fn": "relu",
      "out_act_fn": "identity"
    },
    "policy_network":{
      "network_params": [("mlp", 64), ("mlp", 64)],
      "optimizer_class": "Adam",
      "deterministic": False,
      "learning_rate":0.0003,
      "act_fn": "relu",
      "out_act_fn": "identity",
      "reparameterize": True,
      "stablelize_log_prob": True,
    },
    "entropy":{
      "automatic_tuning": True,
      "learning_rate": 0.0003,
      "optimizer_class": "Adam"
    }
  },
  "trainer":{
    "max_env_steps":200000,
    "batch_size": 256,
    "max_trajectory_length":1000,
    "update_policy_interval": 20,
    "eval_interval": 2000,
    "num_eval_trajectories": 10,
    "save_video_demo_interval": -1,
    "warmup_timesteps": 1000,
    "snapshot_interval": 10000,
    "log_interval": 100,
    "utd": 20
  }
}

seed = 0
#set global seed
set_global_seed(seed)

observation_space = env.observation_space
action_space = env.action_space

#initialize buffer
buffer = ReplayBuffer(observation_space, action_space, **args['buffer'])

#initialize agent
agent = REDQAgent(observation_space, action_space, **args['agent'])

#initialize trainer
trainer  = REDQTrainer(
    agent,
    env,
    test_env, 
    buffer,
    **args['trainer']
)

trainer.train()
x35f commented 1 year ago

USB (and most RL frameworks, because of the nature of neural networks) only supports two kinds of observation input, which are one-dimensional vector (of shape (size,)) and RGB input (of shape (channel,wdith,height), where the channel is 3 for RGB images). If your observation space is image-like, you can try expanding it to 3 channels and enjoy the advantage of receptive field by using the convolutional neural network. Otherwise, you can modify the environment by flattening the observation and use the MLP network to process the input. Both can be done by using the gym.wrapper to wrap the environment. The "PyTorchFrame" class in "unstable_baselines/common/env_wrapper.py" gives an example of modifying the observation space from (channel, width, height) to (width, height, channel). For your environment, you can try the following wrapping solutions:

  1. set the observation space to shape (50,) in the "init" function, and return np.ndarray.flatten(obs) in the "observation" function.
  2. set the observation space to shape (3, 10, 5), and either simply return np.repeat([a],3, axis=0) in the "observation" function , or retrieve some RGB color information if possible. BTW, the target entropy for SAC and REDQ in discrete action environments is very very very tricky, I have been tuning this hyper-parameter for the last few days and achieved some progress in improving the performance. If the default configuration fails for your environment, you can try tuning the target_entropy by modifying the "self.target_entropy = -np.log(1.0 / action_dim) * a_tricky_parameter", where "a_tricky_parameter" is a hyper-parameter between 0 and 1.