zt95 / infinite-horizon-off-policy-estimation

13 stars 7 forks source link

Negative loss #1

Closed clvoloshin closed 5 years ago

clvoloshin commented 5 years ago

Hi! I have a question about how loss is defined.

In the paper, the loss takes the form D(w) = L^2 = E[ d(w,s,a,s') d(w1, s1,a1,s1') k(s,s') ]. In other words, it has the form E[x^T K y] for x=d(w,s,a,s') and y = d(w1,s1,a1,s1'). This means that x^T K y is always positive (since E[x^T K y] = L^2 > 0). However, empirically, when running the sumo code, i'm seeing negative values for the loss_xx. I'm very confused by this. Is this a bug or is negative loss allowed?

screen shot 2019-03-08 at 10 45 50 am

^ shows loss_xx and self.loss for a few epochs of training. Notice that the loss is negative in some cases.

clvoloshin commented 5 years ago

On second thought, Algorithm 1 from the appendix actually takes the form D(w) = x^T K x, but this isn't what the code is doing: the code takes the form self.loss_xx = D(w) = x^T K y. Please advise.

zt95 commented 5 years ago

Good point and sorry for misleading here. Yes you're right, we are trying to implement a more general framework here if we want to try double sampling methods, which may result in negative loss. Typically (which is also the result in our paper) we use V-statistic to use the same batch of samples to estimate the quadratic loss. You can change in train function to use the same subsample to feed in the data.

clvoloshin commented 5 years ago

Right -- so I just change the loss_xx to x^T K x, where K = [K(s'_i, s'_j)] where s_i,s_j come from the same (next) state. Great, thank you!