Open zakajd opened 4 years ago
This part I didn't fully understand, so will leave here for the future
It has recently become a standard policy to combine CNN global image descriptors with simple average query expansion (AQE) [10], [11], [12], [27]. An initial query is issued by Euclidean search and AQE acts on the top-ranked nQE images by average pooling of their descriptors. Herein, we argue that tuning nQE to work well across different datasets is not easy. AQE corresponds to a weighted average where nQE descriptors have unit weight and all the rest zero. We generalize this scheme and we propose performing weighted averaging, where the weight of the i-th ranked image is given by (f(q)^T f (i))^\alpha . The similarity of each retrieved image matters. We show in our experiments that AQE is difficult to tune for datasets of different statistics, while this is not the case with the proposed approach. We refer to this approach as α-weighted query expansion (αQE). The proposed αQE reduces to AQE for α = 0.
using a square root (Hellinger) kernel instead of the standard Euclidean dis- tance to measure the similarity between SIFT descriptors leads to a dramatic performance boost
Cosine softmax losses impose L2-constraint to the features which restricts them to lie on a hypersphere of a fixed radius
Didn't understand:
WHAT IS THE BEST PRACTICE FOR CNNS APPLIED TO VISUAL INSTANCE RETRIEVAL? Paper from 2016 containing distilled wisdom from ancient times.
In our multi-scale approach, the regional vectors from each scale are simply added together and l2-normalized to form the scale-level feature vectors. This works better than concatenating them to form long vector
CosFace: Large Margin Cosine Loss for Deep Face Recognition 2018 LMCL loss
ArcFace: Additive Angular Margin Loss for Deep Face Recognition 2019 Very similar to CosFace and experiments are not very convincing. So it makes sence to always try both.
After the last convolutional layer, we explore the BN [14]-Dropout [31]-FC-BN structure to get the final 512-D embedding feature.
Here is good illustration how all this works (taken from Subcenter ArcFace page)
How to initialise weights in ArcFace / CosFace loss?
We found that initializing the softmax classifier weight with normal distribution std=0.001 generally leads to better performance. It is also important to use larger learning rate for the classifier if underlying CNN is already pretrained.
In code from last year winners used _xavier_uniform__
Metrics: Acc@1 mAP@10
Cumulative Matching Characteristics (CMC) are not defined in case of multi-gallery-shot (multiple correct answers for one image). I'll do same as in Market-1501 dataset: find position of the first correct image and ignore everything else.
Our network consists of the convolutional layers of ResNet101 pre-trained on ImageNet, followed by generalized-mean pooling (GeM), l2 normalization, a fully-connected (FC) layer, and a final l2 normalization.
I liked how strongly diffusion boosted their scores on both public / private datasets. Definitely a thing to try in the future.
We use diffusion, a graph-based query expansion technique, to perform retrieval. We initially construct an affinity matrix with reciprocal k-nearest neighbors (k=50), and then initiate diffusion by the 5 Euclidean nearest neighbors of the query descriptor.
In the MAC architecture, the last convolutional layer of ResNeXt is followed by a max-pooling layer, L2-normalization layer and PCA+Whitening layer.
https://github.com/facebookresearch/faiss Library for efficient search in large vector spaces. May be not relevant now, but would definitely help if number of vectors were much bigger.
The pipeline of our image retrieval system is illustrated in Fig. 1. It consists of five key steps: (1) Deep local feature (DELF) search. (2) Sptatial verification (RANSAC) on top-100 results of (1). (3) Deep image retrieval (DIR) search with database-side feature augmentation (DBA). (4) Query expansion with top-5 results of (3) and the results of (2) with inlier > 40. (5) Re-ranking by regional diffusion. In the next sections, we go over the details of the different components of our pipeline, and also explain how they tie together.
https://arxiv.org/pdf/1610.07940.pdf Description of DBA and QE
One more iteration of paper readings.
The trunk model is an ImageNet pretrained BN-Inception network, with output embedding size of 128. BatchNorm parameters are frozen during training, to reduce overfitting.
Optimal parameters for different datasets. Margin is 20-40x bigger, than what is used in my experiments.
Batch normalization layer is vital just before the L2-normalized features multiplication by the weights. It is indeed mentioned in the original article, but I didn't pay attention at first, and ArcFace started working for me only after I added it.
1st place solution Google Landmark Recognition 2020
One more important thing to note here is that when calculating cosine similarity between different sets, the similarity metric benefits from similarly scaled vectors. So what we do is that we fit a QuantileTransformer (other scalers work similarly well) on the test set
SuperPoint + SuperGlue for local re-ranking
3rd place: Generalized mean pooling with frozen p set to 3 was used; bottleneck structure (GEMPool(2048) -> Linear(512) -> BatchNorm1d -> CosFace(81313)) to reduce computation.
PCA learned on index set to reduce dimensionality of ensemble (512 * 8 -> 1024)
According to the above problems, we select SURF[2] and Hassian-Affine[11, 12] root sift[1] as our local feature method.
According to the above problems, we select SURF[2] and Hassian-Affine[11, 12] root sift[1] as our local feature method. Our local feature image retrieval system is based on nearest neighbor search. To speed up the nearest neigh- bor search, we construct an inverted index which is imple- mented by a k-means clustering with 512 centers, for each point of query image we select top 20 clustering centers to search top1 clustering center of all points descriptors in database set.
Label smoothing[19] is used in the training model, and the soft label parameters is set to 0.1, 0.2.
Какой-то дикий лайфхак, благодаря которому сильно поднимаются в landmark retrieval благодаря улучшенной фильтрации мусорных изображений.
Competition is over, here is a long list of papers I didn't have time to read / implement and repos I didn't have time to look through / copy:
Note: In Google Landmarks task was only to generate good features. No post-processing was allowed, so it's not covered here.
3rd Place Solution to “Google Landmark Retrieval 2020”
Triplet loss Often mentioned in earlier papers. Takes a triplet (baseline (anchor) input, positive (truthy) input and a negative (falsy) input). Tries to make embeddings for simmilar object close to each other and for not-simmilar further. Distance is Euclidian.