yandexdataschool / AgentNet

Deep Reinforcement Learning library for humans
http://agentnet.rtfd.org/
Other
301 stars 71 forks source link

policy_estimators param is weird #93

Closed pshvechikov closed 7 years ago

pshvechikov commented 7 years ago

policy_estimators is really weird in both documentation and usage.

Documentation is very unclear about it (given that action_layers is a separate parameter) : policy_estimators (lasagne.Layer child instance (e.g. Q-values) or a tuple of such instances (e.g. state value + action probabilities for a2c)) – whatever determines agent policy. This explains almost nothing of how to use it and what should be passed to it.

Code is even more strange. The most important part of Agent is get_react_function. It uses self.policy (saved policy_estimators instance), returning the symbolic computation of self.policy layer in a third return value (along with duplicated self.action_layers– i am not sure i understand why to duplicate action_layers). However get_react_function is used only once across AgentNet's codebase. And this exclusive usage of self.get_agent_reaction completely throws off the computed self.policy (named applier_policy in respective scope)

So the evolution of policy_estimators is as follows:

  1. policy_estimators becomes self.policy
  2. self.policy is symbolically computed and assigned to new_outputs
  3. new_outputs, returning from a get_agent_reaction, becomes applier_policy
  4. applier_policy is completely ignored

Thus the only useful place which truly uses self.policy is parameter getters:

    def get_all_params(self, **kwargs):
        """calls lasagne.layers.get_all_params(all_agent_layers,**kwargs)"""
        layers = list(self.agent_states) + self.policy + self.action_layers
        return lasagne.layers.get_all_params(layers, **kwargs)

    def get_all_param_values(self, **kwargs):
        """calls lasagne.layers.get_all_param_values(all_agent_layers,**kwargs)"""
        layers = list(self.agent_states) + self.policy + self.action_layers
        return lasagne.layers.get_all_param_values(layers, **kwargs)

    def set_all_param_values(self, values, **kwargs):
        """calls lasagne.layers.set_all_param_values(all_agent_layers,values,**kwargs)"""
        layers = list(self.agent_states) + self.policy + self.action_layers
        return lasagne.layers.set_all_param_values(layers, values, **kwargs)

But the rationale of keeping additional unclear parameter to Agent which arguably increases clutter is somewhat questionable given it is used mainly for purpose of returning it's parameters and for tracking purposes in self.as_recurrence. Do we really need it?

justheuristic commented 7 years ago

Sorry for such a delay. First, thanks for reporting the difficulties with code reading. I've added a small tip that should help reading the code as well as several bigger clarifications.

The simpler version of what's policy estimators is "if you need something when you write loss function, you throw it into policy estimators".

It is indeed thrown away in get_react_function since you don't need to return q-values when you merely want to take action. The get_agent_reaction method is indeed used only once every time you actually want to interact with any env. The catch is, it is required every time you want to do something outside gym.

The real usage for

By the way what do you think is a more intuitive name for the parameter?

pshvechikov commented 7 years ago

If this parameter is used for getting output for some (not necessarily policy) layers maybe it should be called layers_to_track or smthing similar.

More to say, I don't really think the tip to first study recurrence.py is really helpful.

Sorry for any inconveniences

On Sun, Apr 2, 2017, 23:41 justheuristic notifications@github.com wrote:

Sorry for such a delay. First, thanks for reporting the difficulties with code reading. I've added a small tip https://github.com/yandexdataschool/AgentNet/blob/master/agentnet/agent/mdp_agent.py#L35-L36 that should help reading the code as well as several bigger clarifications.

The simpler version of what's policy estimators is "if you need something when you write loss function, you throw it into policy estimators".

It is indeed thrown away in get_react_function since you don't need to return q-values when you merely want to take action. The get_agent_reaction method is indeed used only once every time you actually want to interact with any env. The catch is, it is required every time you want to do something outside gym.

The real usage for

By the way what do you think is a more intuitive name for the parameter?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/yandexdataschool/AgentNet/issues/93#issuecomment-291013509, or mute the thread https://github.com/notifications/unsubscribe-auth/ADXiYQ05i2ibfH9EAuV10dIWsiO4P-Y8ks5rsAf-gaJpZM4Mdjwc .

justheuristic commented 7 years ago

I added some clarifications in both Agent and Recurrence. In recurrence it's called tracked_outputs. Will migrate Agent there through the regular deprecation procedure.