It learns the value of an action in a particular state. The algorithm uses a table to store the value of each state-action pair and updates the table based on the reward received by the agent. The goal of Q-learning is to find the optimal policy for an agent to take in an environment.
[X] Cross Entropy
It is a Monte Carlo method used to optimize the policy of an agent. It is used to calculate the difference between the predicted probability distribution and the actual probability distribution of actions taken by the agent.
The cross-entropy method is a (Monte Carlo) stochastic optimization algorithm that can be used to solve optimization problems where the objective function is difficult to evaluate directly
[ ] Cross-Entropy Guided Policy (CGP) learning
It is a general Q-function and policy training method that can be combined with most deep Q-learning methods and demonstrates improved stability of training across runs, hyperparameter combinations, and tasks while avoiding the computational expense of a sample-based policy at inference
"""
The base of RL algorithms
:param policy: The policy model to use (MlpPolicy, CnnPolicy, ...)
:param env: The environment to learn from
(if registered in Gym, can be str. Can be None for loading trained models)
:param learning_rate: learning rate for the optimizer,
it can be a function of the current progress remaining (from 1 to 0)
:param policy_kwargs: Additional arguments to be passed to the policy on creation
:param stats_window_size: Window size for the rollout logging, specifying the number of episodes to average
the reported success rate, mean episode length, and mean reward over
:param tensorboard_log: the log location for tensorboard (if None, no logging)
:param verbose: Verbosity level: 0 for no output, 1 for info messages (such as device or wrappers used), 2 for
debug messages
:param device: Device on which the code should run.
By default, it will try to use a Cuda compatible device and fallback to cpu
if it is not possible.
:param support_multi_env: Whether the algorithm supports training
with multiple environments (as in A2C)
:param monitor_wrapper: When creating an environment, whether to wrap it
or not in a Monitor wrapper.
:param seed: Seed for the pseudo random generators
:param use_sde: Whether to use generalized State Dependent Exploration (gSDE)
instead of action noise exploration (default: False)
:param sde_sample_freq: Sample a new noise matrix every n steps when using gSDE
Default: -1 (only sample at the beginning of the rollout)
:param supported_action_spaces: The action spaces supported by the algorithm.
"""
Current status:
RL Algorithms
Model Free (TorchSharp)
[X] QLearning
[X] Cross Entropy
[ ] Cross-Entropy Guided Policy (CGP) learning
Also look into
BaseAlgorithm
Upon which the State of Art RL algorithms depend on.
RL Algorithms
These algorithms are classified into TWO groups
Policy or Non Policy, both inherited from the BaseAlgorithm
off_policy_algorithm.py