Paper readings Dec 2018 [4]

RNN regularisation: This is a pretty old one.
- shows how to apply dropout effectively to deep RNN (LSTM specifically)
- trick is only apply dropout to non-recurrent units (D is the dropout mask below), i.e. output hidden units from previous layer
- benefit: information only corrupted by L+1 times, i.e. only dependent on network depth not timesteps.
Batch normalization: Another old one
- Problem it solves: "Internal Covariate Shift": distribution of activation layer changes during training due to updated model parameters (?)
- For the activation layer you want to normalize (this is before the nonlinear activation function), normalize layer H into H', where H' = (H - miu) / sigma, where miu, sigma is the mean and standard deviation for this layer from this minibatch. Sometimes we use gamma * H' + beta.
- The new network is differentiable (it is essentially inserting a linear transformation layer)
- During inference, use population's mean and standard deviation. So no stochasticity during inference.
- Main benefit is accelerate training: now can use higher learning rate (no gradient explosion), and can use saturation activation functions like sigmoid. And act as a regulizer: an example is forced to be seen together with other randomly selected examples. Also can replace dropout to some extent.
- Result: SOTA ImageNet with 10x less steps
Understanding Batch Normalization: fresh from NIPS2018
- BN mainly enables larger learning rates, which implicitly regularize (increase noise in SGD step) and hence better generalization
- Unnormalized network can have exploding activation and hence large gradient in deeper layers (heavy tailed), so forced to use small learning rate
- Also shows the exploding activation is input-dependent, but more a natural result from random initialization, ideas from random matrix theory.
- Shows careful experiment design and good visualization are important. Although I personally think heatmaps are confusing...
DQN: first deep RL algorithm I'm attempting to reproduce (update: my code for DQN can be found here, without target network though)
- Preprocessing: input (210x160 RGB) -> grayscale -> downsample to 110x84 -> clip to 84x84 -> stack last 4 frames
- Experience replay: experience e_t = {phi_t, a_t, r_t, phi_t+1} where phi is the preprocessed s_t, experience replay buffer D = {e_1, e_2, ..., e_N}. Take last N experiences.
- SGD over uniformly sample from experience replay This decorrelates data
- Assuming episodic with discount (see nature paper for parameters)
- Behavioural policy = sigma-greedy with decaying sigma (see below screenshot)
- Frame skipping for actions with k = 4
- Reward clipping: all positive reward to 1 and all negative reward to -1
- Hyperparameters kept constant across games
- Algorithm:
- Network architecture and training details:
- A more stable metric during training:
- The paper on nature explains target network trick (target Q is computed by a frozen network that is only updated every c steps), it also has more complete list of parameters.

xysun / blog

Paper readings Dec 2018 [4] #8