zhiyuanyou / UniAD

[NeurIPS 2022 Spotlight] A Unified Model for Multi-class Anomaly Detection
Apache License 2.0
250 stars 28 forks source link

Intuition or Precise basis of the LQD design #17

Closed tae-mo closed 1 year ago

tae-mo commented 1 year ago

Hi, thanks for sharing your awesome work. The code is really readable and I learned a lot from your modularization.

I have few questions about your intuition in designing LQD.

1. Multiple query embeddings

The paper says that each decoder layer has its own learnable query embedding to 'intensify the use of query embedding'. I visualized each layer's learned embeddings but I cannot understand how using distinct query embeddings in each layer helps intensifying query embedding usage. The figure below is what I visualized:

스크린샷 2023-01-23 오후 4 17 51

What each query embeddings actually learned? and which intuition made you design that way??

2. Connection of LQD

In vanilla transformer, query embedding first passes through a self attention layer, and then a cross attention layer. In LQD, each query embedding first passes through a cross attention with the encoder output, and then do another cross attention with previous decoder output. What makes this design so good at anomaly detection? I first thought that the design helps query embedding to imitate encoder output by keep reminding the encoder output & previous output, but I think I was wrong after I visualize the encoded tokens..

스크린샷 2023-01-23 오후 4 28 46

Really appreciate if you give some help understanding your work!

Thanks.

zhiyuanyou commented 1 year ago

Hi, For visualization, we suggest to see Sec. 3.2 of README. Since the query embedding is high-dimensional features, it should be decoded to visualize. For intuition, please see Sec. 3.1 and Sec. C.2 of our paper. If you still have other questions, feel free to comment.

---Original--- From: "Taejune @.> Date: Mon, Jan 23, 2023 15:34 PM To: @.>; Cc: @.***>; Subject: [zhiyuanyou/UniAD] Intuition or Precise basis of the LQD design(Issue #17)

Hi, thanks for sharing your awesome work. The code is really readable and I learned a lot from your modularization.

I have few questions about your intuition in designing LQD.

  1. Multiple query embeddings

The paper says that each decoder layer has its own learnable query embedding to 'intensify the use of query embedding'. I visualized each layer's learned embeddings but I cannot understand how using distinct query embeddings in each layer helps intensifying query embedding usage. The figure below is what I visualized:

What each query embeddings actually learned? and which intuition made you design that way??

  1. Connection of LQD

In vanilla transformer, query embedding first passes through a self attention layer, and then a cross attention layer. In LQD, each query embedding first passes through a cross attention with the encoder output, and then do another cross attention with previous decoder output. What makes this design so good at anomaly detection? I first thought that the design helps query embedding to imitate encoder output by keep reminding the encoder output & previous output, but I think I was wrong after I visualize the encoded tokens..

Really appreciate if you give some help understanding your work!

Thanks.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

tae-mo commented 1 year ago

Hi, thanks for your reply.

For the visualization of learned query embeddings, I understand how to visualize them through Sec 3.2 of README. Thanks!

But even after I read Section 3.1 and C.2 of the paper, I still cannot understand about the intuition behind the usage of previous decoder output.

What is the exact role of previous outputs in the second attention? Is it for gradually refining the first decoder layer output? so that the last decoder layer can emit fully refined reconstruction?

Appreciate for your attention !

zhiyuanyou commented 1 year ago

Well, our intuition is to design a multi-layer decoder, so there must be previous outputs.

When we did our design, we always kept num_encoder == num_decoder following existing transformer networks. We have not tried architectures like 4-layer encoder + 1-layer decoder, 4-layer encoder + 2-layer decoder, & 4-layer encoder + 3-layer decoder.

So I can not answer why we must need previous outputs.

You can do experiment with 4-layer encoder + 1-layer decoder to see if 1-layer decoder is powerful enough.

tae-mo commented 1 year ago

I see. I'll try that way. Thanks for the reply. Happy new year 👍🏼