yamatokataoka / learning-from-human-preferences

Replication of Deep Reinforcement Learning from Human Preferences (Christiano et al, 2017).
MIT License
2 stars 0 forks source link

High-level Design #12

Closed yamatokataoka closed 6 months ago

yamatokataoka commented 11 months ago

Design Document for RLHF

Introduction

This project replicates the research paper, Deep Reinforcement Learning from Human Preferences (Christiano et al, 2017). We aim to:

Implementation

To achieve these goals, we will rely on the following technologies:

Tech Stack:

IMG_1761

High-level design

Data Flow:

  1. The RL agent interacts with the MuJoCo environment.
  2. The agent's actions and resulting states are visualized in the web application.
  3. Users provide feedback based on the observed behavior.
  4. Feedback data is collected and stored in Redis.
  5. The reward model is trained using the feedback data.
  6. The learned reward model provides guidance for the RL agent's policy updates.
  7. The improved policy leads to better performance in the MuJoCo environment, showcased in the visualization.