Closed xiang578 closed 2 months ago
https://xiang578.com/post/reinforce-learnning-basic-actor-critic.html
我的笔记汇总: Policy Gradient、PPO: Proximal Policy Optimization、Q-Learning Actor Critic Sparse Reward Imitation Learning Actor Critic policy gradient 给定在某个 state 采取某个 action 的概率。 baseline b 的作用是保证 reward
https://xiang578.com/post/reinforce-learnning-basic-actor-critic.html
我的笔记汇总: Policy Gradient、PPO: Proximal Policy Optimization、Q-Learning Actor Critic Sparse Reward Imitation Learning Actor Critic policy gradient 给定在某个 state 采取某个 action 的概率。 baseline b 的作用是保证 reward