Problem it solves: "Internal Covariate Shift": distribution of activation layer changes during training due to updated model parameters (?)
For the activation layer you want to normalize (this is before the nonlinear activation function), normalize layer H into H', where H' = (H - miu) / sigma, where miu, sigma is the mean and standard deviation for this layer from this minibatch. Sometimes we use gamma * H' + beta.
The new network is differentiable (it is essentially inserting a linear transformation layer)
During inference, use population's mean and standard deviation. So no stochasticity during inference.
Main benefit is accelerate training: now can use higher learning rate (no gradient explosion), and can use saturation activation functions like sigmoid. And act as a regulizer: an example is forced to be seen together with other randomly selected examples. Also can replace dropout to some extent.
BN mainly enables larger learning rates, which implicitly regularize (increase noise in SGD step) and hence better generalization
Unnormalized network can have exploding activation and hence large gradient in deeper layers (heavy tailed), so forced to use small learning rate
Also shows the exploding activation is input-dependent, but more a natural result from random initialization, ideas from random matrix theory.
Shows careful experiment design and good visualization are important. Although I personally think heatmaps are confusing...
DQN: first deep RL algorithm I'm attempting to reproduce (update: my code for DQN can be found here, without target network though)
Preprocessing: input (210x160 RGB) -> grayscale -> downsample to 110x84 -> clip to 84x84 -> stack last 4 frames
Experience replay: experience e_t = {phi_t, a_t, r_t, phi_t+1} where phi is the preprocessed s_t, experience replay buffer D = {e_1, e_2, ..., e_N}. Take last N experiences.
SGD over uniformly sample from experience replay This decorrelates data
Assuming episodic with discount (see nature paper for parameters)
Behavioural policy = sigma-greedy with decaying sigma (see below screenshot)
Frame skipping for actions with k = 4
Reward clipping: all positive reward to 1 and all negative reward to -1
Hyperparameters kept constant across games
Algorithm:
Network architecture and training details:
A more stable metric during training:
The paper on nature explains target network trick (target Q is computed by a frozen network that is only updated every c steps), it also has more complete list of parameters.
D
is the dropout mask below), i.e. output hidden units from previous layerL+1
times, i.e. only dependent on network depth not timesteps.H
intoH'
, whereH' = (H - miu) / sigma
, wheremiu
,sigma
is the mean and standard deviation for this layer from this minibatch. Sometimes we usegamma * H' + beta
.experience e_t = {phi_t, a_t, r_t, phi_t+1}
wherephi
is the preprocesseds_t
, experience replay bufferD = {e_1, e_2, ..., e_N}
. Take last N experiences.k = 4
Q
is computed by a frozen network that is only updated everyc
steps), it also has more complete list of parameters.