Why test pairs show up in the batch_sim matrix generation?

xz-liu / ClusterEA

Source code for KDD22 ClusterEA: Scalable Entity Alignment with Stochastic Training and Normalized Mini-batch Similarities

Apache License 2.0

11 stars 3 forks source link

Why test pairs show up in the batch_sim matrix generation? #3

Closed AdFiFi closed 2 weeks ago

AdFiFi commented 2 weeks ago

I noticed this because I found the batch_sim is a square matrix. But the batch_sim should not be a square matrix since you can't exclude nodes that don't counterparts without some other knowledge. But in line 101-119 (the get_eval_ids() method), you directly select a matching set of nodes from the test set? Isn't the knowledge in the test set unknown?

xz-liu commented 2 weeks ago

I believe it is common practice to set a range of matchable entities using a test set before computing similarity, which, in my view, relies on a 1-to-1 mapping assumption. This evaluation setting is widely used in other repositories as well. When running the evaluation code, the embeddings are typically sorted and filtered based on the test pairs before proceeding with the evaluation.

AdFiFi commented 2 weeks ago

Using ground-truth counterparts as candidates?

xz-liu commented 2 weeks ago

Yes. In almost all papers, only the test pairs are considered when calculating the embeddings. Our paper introduces small blocks to allow for scalability, and this filtering process is implemented within these small blocks. This is equivalent to filtering globally during the evaluation.

You can find similar implementations in OpenEA and DualAMN. I believe this approach adheres to the assumption of 1-to-1 mapping.

If you are interested in exploring beyond the 1-to-1 mapping assumption, you may want to look into the paper on knowledge graph alignment with dangling cases.

Thank you so much for your interest in our work. We are open to questions at any time.

AdFiFi commented 2 weeks ago

But why aren't global_matrix and global_matrix_t in main.py square matrices? The size is the number of nodes in the source graph and the target graph, right?

xz-liu commented 2 weeks ago

Yes. I recall that when I implemented that, I used a sparse matrix so that the similarity matrix would not include the filtered entries. This allowed for filtered evaluation even though the matrix size remains the full size. This was the most convenient way to implement it since we need sparse matrices to store the similarity between a large number of items anyway, and the matrix size is just metadata, not reflecting the actual size of the data.

You could help me check whether this implementation is correct. If not, by fixing it, you would probably achieve a better score than mine.

AdFiFi commented 2 weeks ago

Thank you for your answering and sharing, which help me understand this work better.