Reproduce in mix dataset of Hopper-v3

linhlpv commented 3 months ago

Hi @zaiyan-x ,

Thank for your work.

I'm trying to reproduce your code in the mix dataset of the Hopper-v3 environment. I begin with running the generate_offline_data.py to generate the mix dataset of Hopper-v3. Then I run the code train_rfqi.py to train the agent. However, around 80k iters, the critic loss goes to high value and the max_eta goes to zero.

I am quite confused about this behavior. Did you face the same behavior while training on Hopper-v3? Thank you so much and have a nice day. Best, Linh

zaiyan-x commented 3 months ago

Hi Linh,

I did not run into this before. It seems that ETA network just gave up. The asynchronous updates in ETA and the rest could be the reason. You can notice that once ETA network gives up, the critic loss becomes high, i.e., your value network no longer estimates robust value correctly.

My suggestion is to tune the ETA network hyper-parameter a bit. Hope this helps.

Regards,

ZX

linhlpv commented 3 months ago

Thank @zaiyan-x, let me try this. One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].

Thank you so much and have a good weekend :D . Best, Linh

linhlpv commented 3 months ago

Hi @zaiyan-x ,

I have tried to tune the ETA params a bit but it didn't work. I realized that in the Hopper-v3, the agent could be terminated before the end of the episode (1000). So I incorporated the not_done signal (cause the current code in GitHub didn't have it) as current version: target_Q = reward - gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma with not_done version: target_Q = reward - not_done * (gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma) self.loss_log['robust_target_Q'] = target_Q.mean().item()

I found that the critic loss doesn't go to too high, but the eval reward is smaller than the current version. The red and orange lines are the version with the not_done signal, and the green one is for the current version.

I'm quite confused about this result. Do you think missing not_done signal is the reason for the critic loss problem. And what could cause the lower eval reward?

Thank you so much :D. Best, Linh

zaiyan-x commented 3 months ago

Thank @zaiyan-x, let me try this. One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].

Thank you so much and have a good weekend :D . Best, Linh

Hi Linh,

The reason we used a perturbed actuator_ctrlrange is because we want to see if we let FQI foresee perturbation, i.e., during training rather than just in testing, can RFQI still outperforms FQI. In other words, the diversity of training dataset is meant to "help" FQI.

Regards,

Zaiyan

zaiyan-x commented 3 months ago

Hi @zaiyan-x ,

I have tried to tune the ETA params a bit but it didn't work. I realized that in the Hopper-v3, the agent could be terminated before the end of the episode (1000). So I incorporated the not_done signal (cause the current code in GitHub didn't have it) as current version: target_Q = reward - gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma with not_done version: target_Q = reward - not_done * (gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma) self.loss_log['robust_target_Q'] = target_Q.mean().item()

I found that the critic loss doesn't go to too high, but the eval reward is smaller than the current version. The red and orange lines are the version with the not_done signal, and the green one is for the current version.

I'm quite confused about this result. Do you think missing not_done signal is the reason for the critic loss problem. And what could cause the lower eval reward?

Thank you so much :D. Best, Linh

It could be. I think we had a discussion on this before haha ;) I am glad you found this issue. Yes, I recommend you fix it this way. As for whether this fixes the whole issue, I am not sure. I still think there is something wrong with the training (not you just that this algorithm is very difficult to materialize empirically). One thing I am certain is that max_eta should not decrease to zero. And during my implementation, max_eta usually fluctuates between a reasonable range (which kinda made sense to me). You can use this as a signal for whether the training is catastrophic or not. I apologize that I can't give you a definitive suggestion how to fix this.

linhlpv commented 3 months ago

Hi @zaiyan-x , I have tried to tune the ETA params a bit but it didn't work. I realized that in the Hopper-v3, the agent could be terminated before the end of the episode (1000). So I incorporated the not_done signal (cause the current code in GitHub didn't have it) as current version: target_Q = reward - gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma with not_done version: target_Q = reward - not_done * (gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma) self.loss_log['robust_target_Q'] = target_Q.mean().item() I found that the critic loss doesn't go to too high, but the eval reward is smaller than the current version. The red and orange lines are the version with the not_done signal, and the green one is for the current version. I'm quite confused about this result. Do you think missing not_done signal is the reason for the critic loss problem. And what could cause the lower eval reward? Thank you so much :D. Best, Linh

It could be. I think we had a discussion on this before haha ;) I am glad you found this issue. Yes, I recommend you fix it this way. As for whether this fixes the whole issue, I am not sure. I still think there is something wrong with the training (not you just that this algorithm is very difficult to materialize empirically). One thing I am certain is that max_eta should not decrease to zero. And during my implementation, max_eta usually fluctuates between a reasonable range (which kinda made sense to me). You can use this as a signal for whether the training is catastrophic or not. I apologize that I can't give you a definitive suggestion how to fix this.

Thank you for your suggestion. For me right now it seems like using not_done signal when training makes the etas stable and be in the reasonable range.

linhlpv commented 3 months ago

Thank @zaiyan-x, let me try this. One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85]. Thank you so much and have a good weekend :D . Best, Linh

Hi Linh,

The reason we used a perturbed actuator_ctrlrange is because we want to see if we let FQI foresee perturbation, i.e., during training rather than just in testing, can RFQI still outperforms FQI. In other words, the diversity of training dataset is meant to "help" FQI.

Regards,

Zaiyan

Ohh, I understand. Just one follow-up question to make it more clear for me (of course :D ). I see in the paper you used epsilon greedy during the data generation process. So did you add the random actions to make the dataset more diverse or is there any reason for this this choice?

zaiyan-x commented 3 months ago

Thank @zaiyan-x, let me try this. One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85]. Thank you so much and have a good weekend :D . Best, Linh

Hi Linh, The reason we used a perturbed actuator_ctrlrange is because we want to see if we let FQI foresee perturbation, i.e., during training rather than just in testing, can RFQI still outperforms FQI. In other words, the diversity of training dataset is meant to "help" FQI. Regards, Zaiyan

Ohh, I understand. Just one follow-up question to make it more clear for me (of course :D ). I see in the paper you used epsilon greedy during the data generation process. So did you add the random actions to make the dataset more diverse or is there any reason for this this choice?

Yes, it is for making the dataset more diverse. :d

Hope this helps,

Zaiyan

linhlpv commented 3 months ago

Yup. Thank you so much 👍

Linh

zaiyan-x / RFQI

Reproduce in mix dataset of Hopper-v3 #3