yandexdataschool / Practical_RL

A course in reinforcement learning in the wild
The Unlicense
5.92k stars 1.7k forks source link

Fix Value Targets formula and description in A2C task in week6 #476

Closed q0o0p closed 3 years ago

q0o0p commented 3 years ago

Some time ago, in PR #405 formula for Value Targets in A2C task in week6 was broken.

Why it happened? Because actually there is some ambiguity in definitions of T.

1. T is partial trajectory while partial trajectory is trajectory input argument of ComputeValueTargets.__call__.

2. T is still partial trajectory while partial trajectory for given moment t is part of input trajectory from moment t to the end (i.e. it's different at each iteration in the algorithm).

And for different definitions different variants of formula can be created. But all these variants would be equivalent. Then how we can consider that formula is broken? E.g. because it doesn't pass sanity check: valid sequence of gamma powers.

So, the original formula was for definition (2). And PR #405 looks like an attempt to implement (1).

So, let's revert it to previous version (2), i.e. change upper index of summation T-1-t back to T-1.

What's also important, apart from formula fix, is update description for this formula. Because currently, as per student chat conversation, some students are confused because they don't know how to parse this phrase:

"... list of interactions with the environment of specified length T — the size of partial trajectory."

Either:

"... list of interactions with the (environment of specified length T) — the size of partial trajectory."

Or:

"... (list of interactions with the environment) of specified length T — the size of partial trajectory."

So, to make it less ambiguous, let's change it to:

"... list of interactions with the environment. This list has length T that is the size of partial trajectory. Partial trajectory for given moment "t" is part of "ComputeValueTargets.call" input argument "trajectory" from moment "t" to the end (i.e. different at each iteration in the algorithm)."

review-notebook-app[bot] commented 3 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB