[Mini-project] Improve on DQN

wilsonteng97 / AI-Planning-Decision-Making

Assignments submitted for CS4246/CS5446 AI Planning & Decision Making

0 stars 1 forks source link

[Mini-project] Improve on DQN #22

Open Derek-Hardy opened 3 years ago

Derek-Hardy commented 3 years ago

Problem:

Sparse reward in the environment
Most transitions stored in the replay buffer have no informative reward signal
DQN suffers from poor sample efficiency
Choose random action during exploration is inefficient

Direction:

Improve the performance of exploration
Make use of prior knowledge of goal (always top left)
Distance / subgoal based reward
Prioritised experience replay

Derek-Hardy commented 3 years ago

Optimisation:

Negative distance reward shaping + AtariDQN + weighted sampling

Testing:

Average score: 1.216

TODO:

A reasonable performance should score around 8.0
step(self, state, *args, **kwargs) in __init__.py may need improve (e.g. MCTS)

Derek-Hardy commented 3 years ago

Optimisation:

Positive reward for right direction
Penalise on wrong direction
Penalise on time taken ⭐

Testing:

[t2_tmax50] 300 run(s) avg rewards : 6.0
[t2_tmax40] 300 run(s) avg rewards : 5.1
Point: 5.550000000000001
Local runtime: 228.86762881278992 seconds --- fast
WARNING: do note that this might not reflect the runtime on the server.

BIG PROGRESS ❗

Derek-Hardy commented 3 years ago

Optimisation:

Adjusted positive reward for right direction (x / y axis / overall)
Adjusted penalty on wrong direction
Penalise on time taken

Testing:

[t2_tmax50] 300 run(s) avg rewards : 6.7
[t2_tmax40] 300 run(s) avg rewards : 6.9
Point: 6.833333333333334
Local runtime: 261.7694010734558 seconds --- safe
WARNING: do note that this might not reflect the runtime on the server.

Some more improvements needed, let's aim for >= 8.5

Derek-Hardy commented 3 years ago

Optimisation:

Increased training episodes from 2000 to 2500

Testing:

[t2_tmax50] 300 run(s) avg rewards : 7.3
[t2_tmax40] 300 run(s) avg rewards : 6.9
Point: 7.1
Local runtime: 280.41088366508484 seconds --- safe
WARNING: do note that this might not reflect the runtime on the server.

Derek-Hardy commented 3 years ago

Optimisation:

Increase training episodes from 2000 to 4000

Testing:

[t2_tmax50] 300 run(s) avg rewards : 5.4
[t2_tmax40] 300 run(s) avg rewards : 5.3
Point: 5.35
Local runtime: 256.49775671958923 seconds --- safe

❗ Over-fitting

wilsonteng97 commented 3 years ago

Optimisation:

Increased training episodes to 3500

Testing:

[t2_tmax50] 300 run(s) avg rewards : 7.3
[t2_tmax40] 300 run(s) avg rewards : 6.9
Point: 7.1
Local runtime: 280.41088366508484 seconds --- safe
WARNING: do note that this might not reflect the runtime on the server.

AVLE Score decreased from 6.816 to 6.683. (-0.133)