Closed fyqqyf closed 6 months ago
The return is not calculated using two steps' rewards.
Decision diffuser's implementation computes the return from start
to the end of the current trajectory. Here, the return is computed similarly, but with the last self.horizon - 1
steps removed. The reason is that I pad a sequence of that length to the end of each trajectory:
And since the padded values are not always zero (e.g., repetitions of the last value), they should not be used when calculating the returns.
Thank you for your response! I have deprecated the use of padding in the implementation, which led to a misunderstanding. I apologize for any confusion this may have caused :)
In
sequence.py
, I think the code calculates returns for two steps:But in decision diffuser, they use
self.fields.rewards[path_ind, start : ]
for return.Is there a specific reason for this setup, please let me know ;)