Confusion about Algorithm 1 output

clvoloshin commented 5 years ago

Hi!

Slightly confused about the output of algorithm 1. It says that the off-policy estimate of Pi1 is given by v^T r / sum(v). Consider the case where the reward r is identically -1 for any transition (s,a,r,s'). Then the OPE estimate evaluates to -sum(v)/sum(v) = -1. Which is the correct average reward.

However, if we're considering a (long) variable-length finite horizon with gamma=1, then this algorithm will clearly not work. Do you suggest to use the second algorithm instead with a gamma very very close to 1?

Thanks!!!

clvoloshin commented 5 years ago

In the case that I should use the second algorithm, what is f(s_0) in the RKHS case?

zt95 commented 5 years ago

Thank you again for your interest.

1.For averaging case, our algorithm can still work though it is not theoretically sound. But in practice (like our result shown in experimental part) it can still estimate the average reward.

2.For discounted case, you can slightly change the loss function without changing the whole framework by feed in if the transition pair is the initial transition in your trajectory by feed in a placeholder called 'isStart', then we redo the \delta in paper (or x in code) as

x = (1-self.isStart) w self.policy_ratio + self.isStart * norm_w - w_next

and sample each transition (s_i, a_i, s_i') according to \gamma^i probability for discounted factor \gamma to feed in the data.

I may upload the discounted part when I have time, but you can definitely revise a little bit yourself without much effort.

clvoloshin commented 5 years ago

Great! Awesome.

What is a quick justification for x = (1-self.isStart) w self.policy_ratio + self.isStart * norm_w - w_next? Maybe it's obvious but I don't quite see it.

zt95 commented 5 years ago

In paper equation (15), we have probability (1-\gamma) to let \delta = 1-w_next, and since we need to normalized, 1 becomes norm_w instead. (I think there may be a typo in algorithm2, where it should read (1-w_0)f_0 )

zt95 commented 5 years ago

and self.isStart = 1 or 0 boolean

clvoloshin commented 5 years ago

Ah I see. Will the discounting still work even if gamma = 1? In other words, If I care about the (non-discounted) sum of rewards for a finite horizon?

clvoloshin commented 5 years ago

With this new definition of \delta (or x), I'm still getting the average reward. The reason is as I stated eariler, if r is identically -1, then the output is -\sum(v) / \sum(v) = -1, the average reward (regardless of discounting or not). So I'd have to change this somehow.

Suppose the trajectory length is (on average) 100 for PI0, but the trajectory length is (on average) 150 for PI1. Then, what i'd like is that OPE(PI1) ~= -150 (assuming -1 reward for every step). However, -\sum(v)/\sum(v) = -1. Hm....

Should the output be v^T r / N, where N is the number of trajectories?

zt95 commented 5 years ago

I'm not sure if I understand your question correctly. So you are proposing an empirical estimation where we divide N, the number of trajectories.

What we are doing is divided by the summation of all the importance weight, which leads to normalized style of estimator. In practice/theory it will lead to a much lower variance.

zt95 commented 5 years ago

For your referrence:

https://papers.nips.cc/paper/5249-weighted-importance-sampling-for-off-policy-learning-with-linear-function-approximation.pdf

zt95 / infinite-horizon-off-policy-estimation

Confusion about Algorithm 1 output #2