Fixing DQN policies to LP

Description calling sequence: DQN(learn, inherited from the learn func of the base class OffPolicyAlgorithm) -> OffPolicyAlgorithm (learn, there is a time step loop inside, which calls the collect_rollouts function to collect data & make predictions, and the train function is used to control/update the gradient of the policy network) -> DQN(train, customized after class inherence).

We do not need to investigate what is implemented to update the policy function but just call the policy_convertor function after the train function is called.

Current Behavior

The policy_convertor function is called at every $M$ = 50 time step. (batch training? the target policy network is not updated after a fixed time step.

Q1: what controls the fixed $M$ time steps? A: not explicitly specified, where the value is assigned? Q2: when is the time step updated? (not after learn/train)

Guess: inside the learn function, before the loop begins
(TBC)

x-tu / GGF-wcMDP

Fixing DQN policies to LP #10