zbzhu99 / madiff

Implementation of "MADiff: Offline Multi-agent Learning with Diffusion Models"
MIT License
33 stars 9 forks source link

Issues about the `return` in 'sequence.py' #3

Closed fyqqyf closed 6 months ago

fyqqyf commented 6 months ago

In sequence.py, I think the code calculates returns for two steps:

if self.include_returns:
            rewards = self.fields.rewards[path_ind, start : -self.horizon + 1] # start and start+1
            discounts = self.discounts[: len(rewards)]
            returns = (discounts * rewards).sum(axis=0).squeeze(-1)
            returns = np.array([returns / self.returns_scale], dtype=np.float32)
            batch["returns"] = returns

But in decision diffuser, they use self.fields.rewards[path_ind, start : ] for return.

Is there a specific reason for this setup, please let me know ;)

zbzhu99 commented 6 months ago

The return is not calculated using two steps' rewards.

Decision diffuser's implementation computes the return from start to the end of the current trajectory. Here, the return is computed similarly, but with the last self.horizon - 1 steps removed. The reason is that I pad a sequence of that length to the end of each trajectory:

https://github.com/zbzhu99/madiff/blob/3e8362dccd92f72ffc9937e40e24d86eed6d7a03/diffuser/datasets/sequence.py#L137-L172

And since the padded values are not always zero (e.g., repetitions of the last value), they should not be used when calculating the returns.

fyqqyf commented 6 months ago

Thank you for your response! I have deprecated the use of padding in the implementation, which led to a misunderstanding. I apologize for any confusion this may have caused :)