wouterkool / attention-learn-to-route

Attention based model for learning to solve different routing problems
MIT License
1.04k stars 337 forks source link

Regd Decoder #30

Closed rlanka4 closed 3 years ago

rlanka4 commented 3 years ago

Hi WouterKool,

I would like to get some insights into the decoder implementation as it appears (?) to be different from your description in the paper.

Referring to this part in the code: https://github.com/wouterkool/attention-learn-to-route/blob/master/nets/attention_model.py#L451

From your paper, I understand that the key and context query are used to get the probabilities for each node. In the code, I see that you are using the head (that is attention based convex combination of values of each node) and logit_K.

How is it different from your description in the paper ? What is the functionality of logit_k ?

Thanks, -Ravi

wouterkool commented 3 years ago

Hi!

I'm sorry this was unclear, but I'll try to explain!

logit_k refers to kj in eq. 7 in the paper. It should read as 'the value of k for computing the logit', where the logit is u{(c)j} in eq. 7. This is the key for each node which you are referring to.

The glimpse variable is the updated context embedding, referred to in the paper as H^{N+1}{(c)} (in the paragraph below eq. 6). In eq. 7, the query used is the context query q^{T}{c}, which (I agree this is somewhat implicit) should be computed using eq. 5 as q = W^q h, thus projecting the glimpse (= h). This projection however is omitted in the code, as the glimpse is already the output of a linear projection (see L469, project_out) and stacking two linear transformations without activation has no effect (it would have if the glimpse would also be used for something else but this is not the case).

I agree this is somewhat vague but I hope this is helpful!

rlanka4 commented 3 years ago

That certainly clarifies my question. Sorry, my understand of glimpse was a bit shallow earlier. Thanks for clarifying.

rlanka4 commented 3 years ago

WouterKool, Let me just ask you this and hopefully you'll have the time to answer. What is the intuition behind glimpse and why does it work so well compared to the standard pointer mechanism ? The original papers is a bit too vague. So trying to see if you have a better intuition regd the same ? Infact, I see that you are using two different keys from the node embedding (glimpse_k and logit_k). Thanks!

wouterkool commented 3 years ago

Hi!

I think the glimpse was introduced by Bello et al. (Neural Combinatorial Optimization with RL). In my opinion it can be seen as an extra hidden layer in the decoder. By 'glimpsing' at the remaining node embeddings (since we use the mask to mask out nodes already visited), the decoder can check which options are available. It can then use this information to build the final query which is used in an attention mechanism with the remaining nodes to determine the final probabilities. Of course I am not sure whether this is what actually happens but it may give some intuition.

As the node embeddings are not updated in the 'glimpse layer', we can precompute the keys for the attention mechanisms, both for the glimpse and for the final logits in every step.

I hope this is helpful!

rlanka4 commented 3 years ago

Great. Thanks for the insight. I would definitely buy that :)