zaiyan-x / RFQI

Implementation of Robust Reinforcement Learning using Offline Data [NeurIPS'22]
MIT License
22 stars 3 forks source link

Training time #2

Closed linhlpv closed 7 months ago

linhlpv commented 7 months ago

Hi @zaiyan-x ,

I'm running and trying to reproduce your code for my project. But it takes quite a long for training. I checked the log file and found out that most of the time is for eta optimization, which is around 3 to 5 seconds per one agent training iteration. So I just want to ask that in your experiments, how long did it take you to train one agent (Hopper, HalfCheetah)? And could you have any suggestion to reduce the training time?

Thank you a lot and have a nice day. Best, Linh

linhlpv commented 7 months ago

Hi @zaiyan-x,

I have added the logging time for optimizing the eta network. image I have tried several tricks to optimize the eta training time in each iteration with the main idea being to reduce the total training step required for optimizing eta network at each iteration. But I have no clue whether they work or not. I'm trying to reproduce your results as a strong method in my research. Could you help me with this issue?

Thank you for your time and hope to hear from you soon. Best, Linh

zaiyan-x commented 7 months ago

Hi Linh,

I apologize for the delayed response. Due to ICML duties, I will come back to this later. Thank you for your understanding.

Zaiyan

linhlpv commented 7 months ago

Hi @zaiyan-x ,

Yub. Thank you for your respond :3 and have a good day.

Best, Linh

linhlpv commented 7 months ago

Hi @zaiyan-x ,

I apologize for interrupting you. If you have finished your work, could you please help me with this problem? Thank you so much.

Best, Linh

zaiyan-x commented 7 months ago

Dear Linh,

I apologize for the late reply, and I thank you for the interest in our work.

The eta optimization does take a long time even when we were training RFQI for NeurIPS submission. Currently the scalability of value-based distributionally robust RL algorithm is still an open problem. Unfortunately, there is no way to really circumvent this long training. Before landing on the route where we use g(s,a) as a surrogate function for all etas, we also tried directly solving the dual problem for all (s,a). You can revive this approach if you have time. I apologize that I cannot give you any substantial help in this.

Zaiyan

zaiyan-x commented 7 months ago

One thing I would bring up is that you can do early stopping in ETA function training. In many deep learning algorithms with two gradient update loops, the inner loop approximation can be on the looser side. But of course, if you want the optimal performance of RFQI, I suggest you prolong the training of ETA network.

Hope this helps,

Zaiyan

linhlpv commented 7 months ago

Thank you so much. I have tried to do early stopping. One thing I saw is that you create a new ETA network at each training iteration. I think it could be better to init the ETA network globally and one time at the init function of the RFQI class because we can hope that ETA could converge faster at a new batch. What do you think about it and have you tried it before?

Thank you again for your time. Best, Linh

linhlpv commented 7 months ago

I read your paper and have one question related to the Lemma 7 in the paper. image I was trying to derive the maximum values for y_i and T_gf, but I encountered some difficulties. Could you share your approach for calculating the maximum of both variables?

Thank you so much. Best, Linh

zaiyan-x commented 7 months ago

Dear Linh,

Global initialization of ETA seems a promising improvement. My implementation is strongly based on the intuition that every (s,a) needs an inner optimization (independently). But I think your way can potentially save a lot of time, and hopefully the training is stable enough to lead ETA network to a global optimum.

Regarding Lemma 7, this is due to the reward being bounded between [0,1]. You can see this by putting sup norm on everything in y_i and T_gf.

Thank you,

Zaiyan

linhlpv commented 7 months ago

Yub. I have trained the RFQI with global eta, and I found that it could help reduce the number of iterations required for inner optimization by 2 to 5 times. I also increased the tol param and reduced the learning rate of the eta network. I found that with the small learning rate, and told=5e-3 or 1e-2 instead of 1e-3, I can reduce the time for each interaction from 3s to 0.1s approximately. I will update the evaluation results on the robust benchmarks.

linhlpv commented 7 months ago

One thing I am curious about is the choice of offline RL algorithm for RFQI. Have you tried the other offline RL algorithms like CQL, IQL, SPOT, etc?

Thank you so much for your time. Have a good day. Linh

zaiyan-x commented 7 months ago

Our algorithm has algorithmic roots in Fitted Q-Iteration which formulates the Q-function estimation problem as a regression problem. I believe the algorithms you mentioned are not of this nature.

Best regards,

Zaiyan

linhlpv commented 7 months ago

Yub. Thank you for your answer. I totally agree with it. And do you think the development of offline robust actor-critic methods is feasible?

Best, Linh

linhlpv commented 7 months ago

I also want to say thanks a lot for your time, answers, and suggestions. 😄

zaiyan-x commented 7 months ago

Robust actor-critic methods are very challenging to design and analyze. Since we only have data, whether it is offline or online, that comes from the nominal model, the gradient will always be biased for the realizing worst model.

I recommend our colleague's work as a starting point: Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation.

You are welcome and good luck on your robust endeavor,

Zaiyan

linhlpv commented 7 months ago

Thank you for your suggestion, I'll read this paper.

Best, Linh