xysun commented 5 years ago

[blog post] Reinforcement learning with prediction based rewards
- link
- notes
Scalable agent alignment via reward modeling: a research direction
- arXiv
- notes
World models
- link

xysun commented 5 years ago

Reinforcement learning with prediction based rewards

achieves better than human in Montezuma's Revenge, with no demonstration or access to underlying state
Previous work:
- curiority driven learning: learn without environment-specific rewards
- intrinsic reward = error(environment state, predicted next state)
- motivates exploring unpredictable states
- agent can be trapped by noise from environment "noisy TV problem" (there is an amusing demonstration btw 😂 )
Random Network Distillation to avoid noisy TV problem:
- predicting the output of a fixed and randomly initialized neural network on the next state, given the next state itself
Implementation
- RND as a exploration bonus together with extrinsic rewards
- PPO with two value heads for two reward streams
Need to understand more on:
- theoretically how does RND avoids noisy TV
  
  The intuition is that predictive models have low error in states similar to the ones they have been trained on. In particular the agent’s predictions of the output of a randomly initialized neural network will be less accurate in novel states than in states the agent visited frequently. The advantage of using a synthetic prediction problem is that we can have it be deterministic (bypassing Factor 2) and inside the class of functions the predictor can represent (bypassing Factor 3) by choosing the predictor to be of the same architecture as the target network. These choices make RND immune to the noisy-TV problem.
follow up readings:
things to try

Train a curious agent on many different environments without reward and investigate the transfer to target environments with rewards.

xysun commented 5 years ago

Scalable agent alignment via reward modeling: a research direction

Goal:

sequential decision making without explicit reward function
agents that "aligns" / understands user's intentions
- align single agent to single user

Proposed approach:

learn a reward function with user's feedback (supervised learning)
- recursive reward modeling: assist feedback when domain is too complex to evaluate
  - use agents trained in simpler tasks/narrower domains
  - eg. when evaluate a novel, use agent to give plot summary, spelling mistakes, etc.
train policy w.r.t the learned reward function (reinforcement learning)
trust: how to know agent is aligned

xysun / blog

Paper readings Nov 2018 [3] #7

Reinforcement learning with prediction based rewards

Scalable agent alignment via reward modeling: a research direction