Open daniel-xion opened 1 year ago
Hi Daniel,
Not sure this is really an issue with the code or inherent behavior of the policy gradient method. Based on your description, I'm inclined to believe the latter. With a large number of arms, the algorithm may indeed stall at suboptimal solutions for longer periods of times. Personally, I could not reproduce the issue though. For me, bandits with 100 or 200 arms (with a single positive payout) converge just fine with the original code.
Although I understand the reasoning of the proposed code update, merging an epsilon-greedy mechanism with the exploration mechanism ingrained in policy gradients is not ideal for a textbook example. Your solution may well lead to faster convergence, but I fear it would obscure the educational purpose of the code.
Kind regards,
Wouter
I wonder where should I amend the code to correctly include an epsilon greedy agent for the multi armed bandit? This is the code created but not sure if it works correctly. For some bandit distribution, the agent sometimes keep choosing the same wrong arm for more than 5000 episodes (for less than 100 arms in total).
In the code below, load_bandit is the saved network trained in previous trials. I have changed its input to have the same size as number of bandits; pred_p is a prior probability estimate for the bandit distribution; bandit_ground is the ground truth of the bandit reward distribution.
For example, bandit_ground = [0, 1, 0, 0] whereas pred_p = [0.1, 0.5, 0.2, 0.2].
` e = 0.001
The new neural network is defined as below:
`
def construct_actor_network(bandits):
`