Current status:

RL Algorithms

[x] A2C
[x] DQN

Model Free (TorchSharp)

[X] QLearning

It learns the value of an action in a particular state. The algorithm uses a table to store the value of each state-action pair and updates the table based on the reward received by the agent. The goal of Q-learning is to find the optimal policy for an agent to take in an environment.
[X] Cross Entropy
- It is a Monte Carlo method used to optimize the policy of an agent. It is used to calculate the difference between the predicted probability distribution and the actual probability distribution of actions taken by the agent.
- The cross-entropy method is a (Monte Carlo) stochastic optimization algorithm that can be used to solve optimization problems where the objective function is difficult to evaluate directly
[ ] Cross-Entropy Guided Policy (CGP) learning

It is a general Q-function and policy training method that can be combined with most deep Q-learning methods and demonstrates improved stability of training across runs, hyperparameter combinations, and tasks while avoiding the computational expense of a sample-based policy at inference

SB3 RL Algorithms comparisons

Also look into

BaseAlgorithm

Upon which the State of Art RL algorithms depend on.

RL Algorithms

These algorithms are classified into TWO groups

Policy or Non Policy, both inherited from the BaseAlgorithm

on_policy_algorithm.py

off_policy_algorithm.py

""" The base of RL algorithms :param policy: The policy model to use (MlpPolicy, CnnPolicy, ...) :param env: The environment to learn from (if registered in Gym, can be str. Can be None for loading trained models) :param learning_rate: learning rate for the optimizer, it can be a function of the current progress remaining (from 1 to 0) :param policy_kwargs: Additional arguments to be passed to the policy on creation :param stats_window_size: Window size for the rollout logging, specifying the number of episodes to average the reported success rate, mean episode length, and mean reward over :param tensorboard_log: the log location for tensorboard (if None, no logging) :param verbose: Verbosity level: 0 for no output, 1 for info messages (such as device or wrappers used), 2 for debug messages :param device: Device on which the code should run. By default, it will try to use a Cuda compatible device and fallback to cpu if it is not possible. :param support_multi_env: Whether the algorithm supports training with multiple environments (as in A2C) :param monitor_wrapper: When creating an environment, whether to wrap it or not in a Monitor wrapper. :param seed: Seed for the pseudo random generators :param use_sde: Whether to use generalized State Dependent Exploration (gSDE) instead of action noise exploration (default: False) :param sde_sample_freq: Sample a new noise matrix every n steps when using gSDE Default: -1 (only sample at the beginning of the rollout) :param supported_action_spaces: The action spaces supported by the algorithm. """

xin-pu / DeepSharp

Deep RL Algorithms #4

RL Algorithms

Model Free (TorchSharp)

RL Algorithms