x-tu / GGF-wcMDP

0 stars 0 forks source link

Fixing DQN policies to LP #10

Open x-tu opened 1 year ago

x-tu commented 1 year ago

Current issue: The model training process is packed inside the learn function (which sequentially calls the train function, and then the collect_rollout function). To connect to an LP solver, it might not be very efficient to return the policy network at every time step, but better to modify the function inside the SB3 library.

Solution (Steps to go) At each iteration, call the policy function and solve for LP:

x-tu commented 1 year ago

Description calling sequence: DQN(learn, inherited from the learn func of the base class OffPolicyAlgorithm) -> OffPolicyAlgorithm (learn, there is a time step loop inside, which calls the collect_rollouts function to collect data & make predictions, and the train function is used to control/update the gradient of the policy network) -> DQN(train, customized after class inherence).

We do not need to investigate what is implemented to update the policy function but just call the policy_convertor function after the train function is called.

Current Behavior

  1. The policy_convertor function is called at every $M$ = 50 time step. (batch training? the target policy network is not updated after a fixed time step.

    Q1: what controls the fixed $M$ time steps? A: not explicitly specified, where the value is assigned? Q2: when is the time step updated? (not after learn/train)

    Guess: inside the learn function, before the loop begins

  2. (TBC)