yaringal / DropoutUncertaintyDemos

What My Deep Model Doesn't Know...
MIT License
116 stars 23 forks source link

Positive reward with 4 walls #1

Open mryellow opened 9 years ago

mryellow commented 9 years ago

looking at a wall with 4 eyes while walking into it resulted in a positive reward;

http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html

That's the > 0.75 threshold for forward reward. With a few eyes missing the walls, the overall proximity drops more it does in most cases, if the agent can get a little bonus for forward at that stage it will take it.

https://github.com/yaringal/DropoutUncertaintyDemos/blob/14fa4689bcf29e280bf3bb5c967f8bf10e530178/convnetjs/rldemo_comparison.js#L368

Generally I've found the threshold still works ok, takes tweaking but is kind of a "this is a doorway you'll accept" vs "that's a little too risky" in the end. Thinking the best bet would be to remove it and punish harder on walls some other way, so the forward bonus can't win out against walls when multiplied by those last few decimal points of the proximity being fed in.

mryellow commented 9 years ago

This might work a little better, falling off quickly on the low end, instead of forward reward the instant walls are considered "clear".

if(this.actionix === 0 && proximity_reward > 0.2) forward_reward = 0.1 * Math.sqrt(proximity_reward-0.2);

edit: Actually probably behaves better the other way, sqrt will squeeze through some pretty small gaps though.

if(this.actionix === 0) forward_reward = 0.1 * Math.pow(proximity_reward, 2);

mryellow commented 9 years ago

I'm finding generally that dropout (regardless of uncertainty implemented or not) will become obsessed with any conditional reward which jumps up/down out of nowhere.

For instance halving forward reward for forward turns:

if (this.actionix === 0 || this.actionix === 1 || this.actionix === 2) {
    forward_reward = whatever number;
    if (this.actionix === 1 || this.actionix === 2) {
        forward_reward = forward_reward / 2;
    }
}

Dropout will find itself hard up against a wall, looking along it, exploiting what it can from the half forward reward. Smoothly distributed rewards on the other hand will be exploited without so much unexpected behavior.