yamatokataoka / learning-from-human-preferences

Replication of Deep Reinforcement Learning from Human Preferences (Christiano et al, 2017).
MIT License
2 stars 0 forks source link

initial research #7

Open yamatokataoka opened 2 years ago

yamatokataoka commented 2 years ago
yamatokataoka commented 2 years ago
yamatokataoka commented 2 years ago

implementations

yamatokataoka commented 2 years ago

final training code would look like this using rl-replicas

class SamplerWithHumanPrefs():
    def __init__(self):

    def sample(self, env: Env, steps: int, reward_predictor, human_pref_collector):

        for :
            human_pref_collector.push(observations)
            prefs = human_pref_collector.collect()
            reward_predictor.train(prefs)
            reward = reward_predictor.reward(observations)
yamatokataoka commented 1 year ago

When do we collect preferences?