Open x-tu opened 1 year ago
Description
calling sequence:
DQN(learn
, inherited from the learn
func of the base class OffPolicyAlgorithm
) -> OffPolicyAlgorithm (learn
, there is a time step loop inside, which calls the collect_rollouts
function to collect data & make predictions, and the train
function is used to control/update the gradient of the policy network) -> DQN(train
, customized after class inherence).
We do not need to investigate what is implemented to update the policy function but just call the policy_convertor
function after the train
function is called.
Current Behavior
The policy_convertor
function is called at every $M$ = 50 time step.
(batch training? the target policy network is not updated after a fixed time step.
Q1: what controls the fixed $M$ time steps? A: not explicitly specified, where the value is assigned? Q2: when is the time step updated? (not after learn/train)
Guess: inside the learn function, before the loop begins
(TBC)
Current issue: The model training process is packed inside the
learn
function (which sequentially calls thetrain
function, and then thecollect_rollout
function). To connect to an LP solver, it might not be very efficient to return the policy network at every time step, but better to modify the function inside the SB3 library.Solution (Steps to go) At each iteration, call the policy function and solve for LP: