worldmodels / worldmodels.github.io

World Models
Creative Commons Attribution 4.0 International
430 stars 55 forks source link

Question #2

Open jpeg729 opened 6 years ago

jpeg729 commented 6 years ago

According to my understanding, the controller C at time t has two inputs. zt and ht, now ht is the prediction produced by M from zt-1, ht-1 and at-1.

So basically, the controller makes its predictions based on...

  1. zt: the current state of the world,
  2. M(zt-1, at-1, ht-1): i.e. what it thought the current state of the world would be given the previous state of the world + its previous chosen action + the previous hidden state of M.

It seems counter-intuitive to me that the current state of the world together with the expectation of the current state of the world should be a sufficient basis for a strong controller. It doesn't seem to make full use of the predictors capabilities.

Is this correct?

hardmaru commented 6 years ago

Hi @jpeg729,

That is a great question!

You are correct, the controller's calculation of at is based on zt and ht, and it doesn't use ht+1, since ht+1 needs at to be calculated first, so kind of a chicken and egg problem.

In that sense, one can view ht as a compressed representation of all of the zi and ai for i ∈ {0 ... t-1}. Thus in addition to the current observation zt, the controller's decision will be based on this compressed representation of the entire history up to the point in which it has to make a decision at.

This is something I have thought about when constructing the setup and the algorithm, since ideally we want to use the current h. For the experiments I tried this seemed to be good enough, though the fact that a more complicated controller (with an extra hidden layer) in the Car Racing setup improves the results quite a bit suggests that there is more we can do here.

One thing I have thought about trying, but haven't gotten to it, is to have the controller calculate a temporary ā = controller.action([z, h]), and by rolling forward using this temporary ā to arrive at a temporary = rnn.forward([ā, z, h]), and see if we can get a policy with this rolled-forward temporary hidden state. It doesn't look as elegant as the current approach though. Other methods of rolling forward to do planning might also help, at the expense of complexity and elegance.

Alternatively, we can modify the RNN's roll-forward operation to only depend on z, and not a, but have the MDN-layer's prediction based on a instead, so you can roll-forward h and use it in the same time step to make the prediction. This might be the best option I feel, if we really want to use the forward hidden state. While this will allow the use of a more current h, the MDN (currently just a linear layer) will need to be have more capacity to compensate for the extra processing needed there.

# modification to use the forward h:
def rollout(controller):
  ''' env, rnn, vae are '''
  ''' global variables  '''
  obs = env.reset()
  h = rnn.initial_state()
  done = False
  cumulative_reward = 0
  while not done:
    z = vae.encode(obs)
    h = rnn.forward([z, h])
    # (Note: MDN-layer modified to rely on a and h, not just h)
    a = controller.action([z, h])
    obs, reward, done = env.step(a)
    cumulative_reward += reward
  return cumulative_reward

If you come up with a more elegant way to calculate the forward state, feel free to share!

Best.

jpeg729 commented 6 years ago

I must revise my opinion of the usefulness of the RNN+MDN to the controller. If we conceptually separate the RNN and the MDN, then we can simplistically consider zt to be an encoding of object positions, and RNN(zt, ht-1, at-1) to be an encoding of object velocities and accelerations.

One remark: If the RNN has no knowledge of previous actions, then it will be confused by any changes that result directly from the players action. It would seem more logical to do RNN(zt, at-1, ht-1), since that will allow the RNN to more accurately calculate the velocity and acceleration of the player's glyph.

I have been digging into the demo source code to verify certain details, and I noticed this...

In an adversarial setting, it may make sense to provide the controller with zt and expected_zt. This sort of information could be valuable since it allows the controller to measure the unexpectedness of the opponents actions "I thought he was going to do this, but he actually did that".

hardmaru commented 6 years ago

Hi @jpeg729

Thanks for your comments and reply.

I just want to clarify one thing with you. The controller receives the sampled zt, rather than the expected zt.

The sampling is achieved in two parts. For example, in the DoomRNN code:

(1) sample the individual z's inside each of the 5 mixtures, in line 622:

var zs = math.add(mu, math.multiply(std, epsilon)) // 5 possible z's

(doing this inside GPU ops instead was faster, that's why the sampling in line 682 was commented out)

(2) sampling which mixture we should choose, in line 673-679:

idx = sample_softmax(normalize(sub_p));
if (idx < 0) {// sampling error (due to bad precision in gpu mode)
     idx = randi(0, num_mixture);
}

k = num_mixture*i+idx;
next_z[i] = zs[k]; // + std[k] * epsilon[i]; (no need to sample here, already done inside deeplearn.js op

I originally intended to sample the idx (which mixture we use) inside GPU mode as well for efficiency, but at the time of development, there was a weird bug with deeplearn.js where it would only work for Chrome but not for iOS/Safari, so had to resort to sampling in normal JS outside of GPU/deeplearn.js. I think older laptops that don't have support for WebGL (v1) will not run these demos unfortunately, as pure JS on raw CPU didn't seem fast enough. The deeplearn.js (now tensorflow.js) engineers know about this bug and workaround I did for iOS/Safari and it should be solved in the (near) future.

Let me know if you have if any of this is unclear, or if you have any further insights and comments!

Best.

(btw, please don't "close" this issue since I want your comments and this discussion to be clearly visible to other readers in the future)

AliBaheri commented 5 years ago

Hi @hardmaru In you first response for the first question in this tread there is an statement which has confused me:

In that sense, one can view ht as a compressed representation of all of the zi and ai for i ∈ {0 ... t-1}. Thus in addition to the current observation zt, the controller's decision will be based on this compressed representation of the entire history up to the point in which it has to make a decision at.

If ht is just a compressed representation of what has been done in past and if zt is the current observation come from the V, then which component has the role to compute the future prediction?