Open dljzx opened 2 years ago
Use the following hyperaparameters for Antmaze:
python -m SimpleSAC.conservative_sac_main \
--env 'antmaze-medium-diverse-v2' \
--cql.cql_min_q_weight=5.0 \
--cql.cql_max_target_backup=True \
--cql.cql_target_action_gap=0.2 \
--orthogonal_init=True \
--cql.cql_lagrange=True \
--cql.cql_temp=1.0 \
--cql.policy_lr=1e-4 \
--cql.qf_lr=3e-4 \
--cql.cql_clip_diff_min=-200 \
--reward_scale=10.0 \
--reward_bias=-5.0 \
--policy_arch='256-256' \
--qf_arch='256-256-256' \
--policy_log_std_multiplier=0.0 \
--eval_period=50 \
--eval_n_trajs=100 \
--n_epochs=1200 \
--bc_epochs=40 \
--logging.output_dir './experiment_output'
Use the following hyperaparameters for Antmaze:
python -m SimpleSAC.conservative_sac_main \ --env 'antmaze-medium-diverse-v2' \ --cql.cql_min_q_weight=5.0 \ --cql.cql_max_target_backup=True \ --cql.cql_target_action_gap=0.2 \ --orthogonal_init=True \ --cql.cql_lagrange=True \ --cql.cql_temp=1.0 \ --cql.policy_lr=1e-4 \ --cql.qf_lr=3e-4 \ --cql.cql_clip_diff_min=-200 \ --reward_scale=10.0 \ --reward_bias=-5.0 \ --policy_arch='256-256' \ --qf_arch='256-256-256' \ --policy_log_std_multiplier=0.0 \ --eval_period=50 \ --eval_n_trajs=100 \ --n_epochs=1200 \ --bc_epochs=40 \ --logging.output_dir './experiment_output'
Thanks for your code update. It did work. By the way, in your code behavior cloning is used in the first 40 epochs, while this trick did not mentioned in the paper. So why is bc so important in antmaze environment? What if we do not use it?
Thanks for you work on CQL. It really works on many enviornments, but in Antmaze environment it perform badly. Can you figure it out?