young-geng / CQL

Conservative Q Learning on top of SAC
MIT License
119 stars 25 forks source link

Antmaze results #2

Open dljzx opened 2 years ago

dljzx commented 2 years ago

Thanks for you work on CQL. It really works on many enviornments, but in Antmaze environment it perform badly. Can you figure it out?

young-geng commented 2 years ago

Use the following hyperaparameters for Antmaze:

python -m SimpleSAC.conservative_sac_main \
    --env 'antmaze-medium-diverse-v2' \
    --cql.cql_min_q_weight=5.0 \
    --cql.cql_max_target_backup=True \
    --cql.cql_target_action_gap=0.2 \
    --orthogonal_init=True \
    --cql.cql_lagrange=True \
    --cql.cql_temp=1.0 \
    --cql.policy_lr=1e-4 \
    --cql.qf_lr=3e-4 \
    --cql.cql_clip_diff_min=-200 \
    --reward_scale=10.0 \
    --reward_bias=-5.0 \
    --policy_arch='256-256' \
    --qf_arch='256-256-256' \
    --policy_log_std_multiplier=0.0 \
    --eval_period=50 \
    --eval_n_trajs=100 \
    --n_epochs=1200 \
    --bc_epochs=40 \
    --logging.output_dir './experiment_output'
dljzx commented 2 years ago

Use the following hyperaparameters for Antmaze:

python -m SimpleSAC.conservative_sac_main \
    --env 'antmaze-medium-diverse-v2' \
    --cql.cql_min_q_weight=5.0 \
    --cql.cql_max_target_backup=True \
    --cql.cql_target_action_gap=0.2 \
    --orthogonal_init=True \
    --cql.cql_lagrange=True \
    --cql.cql_temp=1.0 \
    --cql.policy_lr=1e-4 \
    --cql.qf_lr=3e-4 \
    --cql.cql_clip_diff_min=-200 \
    --reward_scale=10.0 \
    --reward_bias=-5.0 \
    --policy_arch='256-256' \
    --qf_arch='256-256-256' \
    --policy_log_std_multiplier=0.0 \
    --eval_period=50 \
    --eval_n_trajs=100 \
    --n_epochs=1200 \
    --bc_epochs=40 \
    --logging.output_dir './experiment_output'

Thanks for your code update. It did work. By the way, in your code behavior cloning is used in the first 40 epochs, while this trick did not mentioned in the paper. So why is bc so important in antmaze environment? What if we do not use it?