Open yamatokataoka opened 2 years ago
final training code would look like this using rl-replicas
class SamplerWithHumanPrefs():
def __init__(self):
def sample(self, env: Env, steps: int, reward_predictor, human_pref_collector):
for :
human_pref_collector.push(observations)
prefs = human_pref_collector.collect()
reward_predictor.train(prefs)
reward = reward_predictor.reward(observations)
When do we collect preferences?