Epsilon greedy bandit - Githubissues

I wonder where should I amend the code to correctly include an epsilon greedy agent for the multi armed bandit? This is the code created but not sure if it works correctly. For some bandit distribution, the agent sometimes keep choosing the same wrong arm for more than 5000 episodes (for less than 100 arms in total).

In the code below, load_bandit is the saved network trained in previous trials. I have changed its input to have the same size as number of bandits; pred_p is a prior probability estimate for the bandit distribution; bandit_ground is the ground truth of the bandit reward distribution.

For example, bandit_ground = [0, 1, 0, 0] whereas pred_p = [0.1, 0.5, 0.2, 0.2].

` e = 0.001

for i in range(episodes):    

    with tf.GradientTape() as tape: 

        action_probabilities = load_bandit(tf.constant([pred_p], dtype=tf.float32))

        # Select random action based on probabilities and exploration
        if np.random.rand(1) < e:
            action = np.random.choice(len(bandits), p=np.squeeze(action_probabilities))
        else:
            action = np.random.choice(len(bandits), p=pred_p)

        # Obtain reward
        reward = get_reward(bandit_ground[action]) 

        # Store probability of selected action
        probability_action = action_probabilities[0, action]

        # Compute loss
        loss_value = cross_entropy_loss(probability_action, reward)

        # Compute the gradients
        grads = tape.gradient(loss_value[0], load_bandit.trainable_variables)

        # Update network weights using the above gradient
        opt.apply_gradients(zip(grads, load_bandit.trainable_variables))`

The new neural network is defined as below:

def construct_actor_network(bandits):

model = Sequential()
model.add(Dense(128, input_dim=250, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(250, activation='softmax'))

model.compile(optimizer=opt, loss='binary_crossentropy')
return model

Hi Daniel,

Not sure this is really an issue with the code or inherent behavior of the policy gradient method. Based on your description, I'm inclined to believe the latter. With a large number of arms, the algorithm may indeed stall at suboptimal solutions for longer periods of times. Personally, I could not reproduce the issue though. For me, bandits with 100 or 200 arms (with a single positive payout) converge just fine with the original code.

Although I understand the reasoning of the proposed code update, merging an epsilon-greedy mechanism with the exploration mechanism ingrained in policy gradients is not ideal for a textbook example. Your solution may well lead to faster convergence, but I fear it would obscure the educational purpose of the code.

Kind regards,

Wouter

woutervanheeswijk / example_discrete_control

Epsilon greedy bandit #2