yandexdataschool / Practical_RL

A course in reinforcement learning in the wild
The Unlicense
5.92k stars 1.7k forks source link

Week2 Seminar-VI frozen lake problem #496

Closed Dikshuy closed 2 years ago

Dikshuy commented 2 years ago

In the frozen lake problem, I am getting an assertion about the invalid action taken while finding out the state values. It's using the mdp.py file to get the next state and possible action from a given state. The error indicates that: AssertionError: cannot do action left from the state (0, 0) implies it can't take action 0 from the state (0,0), but when I printed mdp.get_possible_actions((0,0)), it gave me an output: ('left', 'down', 'right', 'up'). The transition probability is zero for this transition, but I can't take this as a condition as well because it again raises the assertion error. My other functions are correct as I get the correct results for the above MDP. Can anyone please tell me what needs to be done, or it's the issue in the frozen lake problem itself? I have attached the log as well:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
[<ipython-input-49-2df897c74778>](https://localhost:8080/#) in <module>()
----> 1 state_values = value_iteration(mdp)

4 frames
[<ipython-input-41-b1cacaad4da4>](https://localhost:8080/#) in value_iteration(mdp, state_values, gamma, num_iter, min_difference)
      7     new_state_values = {}
      8     for state in mdp.get_all_states():
----> 9       new_state_values[state] = get_new_state_value(mdp, state_values, state, gamma)
     10 
     11       assert isinstance(new_state_values, dict)

[<ipython-input-48-7fb51cc228af>](https://localhost:8080/#) in get_new_state_value(mdp, state_values, state, gamma)
      5   state_value = -np.inf
      6   for action in mdp.get_possible_actions(state):
----> 7     state_value = max(state_value, get_action_value(mdp, state_values, state, action, gamma))
      8   print(state_value)
      9   return state_value

[<ipython-input-46-4570b47c3b7c>](https://localhost:8080/#) in get_action_value(mdp, state_values, state, action, gamma)
      5   qsa = 0
      6   for s in mdp.get_next_states(state, action):
----> 7     print(mdp.get_transition_prob(str(state), action, str(s)))
      8     if mdp.get_transition_prob(str(state), action, str(s)):
      9       qsa += mdp.get_transition_prob(str(state), action, str(s)) * (mdp.get_reward(str(state), action, str(s)) + gamma*state_values[s])

[/content/mdp.py](https://localhost:8080/#) in get_transition_prob(self, state, action, next_state)
     76     def get_transition_prob(self, state, action, next_state):
     77         """ return P(next_state | state, action) """
---> 78         return self.get_next_states(state, action).get(next_state, 0.0)
     79 
     80     def get_reward(self, state, action, next_state):

[/content/mdp.py](https://localhost:8080/#) in get_next_states(self, state, action)
     71     def get_next_states(self, state, action):
     72         """ return a dictionary of {next_state1 : P(next_state1 | state, action), next_state2: ...} """
---> 73         assert action in self.get_possible_actions(state), "cannot do action %s from state %s" % (action, state)
     74         return self._transition_probs[state][action]
     75 

AssertionError: cannot do action left from state (0, 0)
Dikshuy commented 2 years ago

nevermind, I am able to resolve this issue! I was converting the states into strings like: str(state), action, str(s) which was causing trouble.