Bug in Beam Search - Githubissues

TimD3 commented 2 years ago

There might be a bug in the beam search. If my understanding is correct increasing the beam size should never worsen the result. When testing on very large problems (CVRP1000) with beam sizes up to 4096 I'm experiencing some inconsistent results. Up to like a beam size of 16 results get better and then they start fluctuate a bit with the tendency to get worse again. A beam size of 4096 yields the same result then as the beam size of two.

I'm running something like this: srun python eval.py \ data/vrp/vrp_uniform_1000.pkl \ -f \ --no_progress_bar \ --decode_strategy bs \ --eval_batch_size 1 \ --width 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 \ --model pretrained/cvrp_50/epoch-99.pt \ --max_calc_batch_size 10000 \ --softmax_temperature 1

Is this known? Anything I should try?

wouterkool commented 2 years ago

Hi!

Thanks for using our code! This is an interesting finding but does not mean there is a bug (per se) for two reasons:

First, while on average a larger beam size has better performance, there is no theoretical guarantee. With a larger beam size, you may include a partial solution which seems better at first, such that its descendants push other partial solutions off the beam that would otherwise have been extended until completion. This may result in better overal solutions not being found with a larger beam size. You can debug/verify this yourself: find an example where a larger beam size yields worse performance and track the better solution from the smaller beam size in the bigger beam: when does it fall off the beam? Verify that indeed it has a worse score at that point.
Second, the beam is selected based on the model score, whereas performance is evaluated using route length. There is a mismatch, which should be small if the model is trained well. However, if the model is tested in a setting it was not trained for (e.g. a model trained on 50 points is tested on 1000 points), it is more likely that there is a mismatch, where the model selects 'bad' partial solutions thinking they are good/promising. This may be more prominent with a larger beam size, where there are more possibilities for the model to be 'confused' (i.e. pick an outlier which is bad but considered good by the model). Also this you can verify by checking the correlation between the model score and the cost of the final prediction.

I am no expert, but I think similar results are seen when using beam search in NLP: bigger beam sizes give worse results and the 'search error' (from a small beam size) is actually used as a feature. See for example If beam search is the answer, what was the question?.

TimD3 commented 2 years ago

Ah yes you are right. My error of thinking was in your first point. I thought a beam size of 16 for instance must include all results from a beam size of 8 and only add more on to it. I didnt see how they would be pushed out later again because I thought only one expansion ahead so to say.

Thanks for the super fast response and also thanks for the code base!

wouterkool / attention-learn-to-route

Bug in Beam Search #46