Paper review - Githubissues

zakajd commented 4 years ago

1st Place Solution to Google Landmark Retrieval 2020
- Used GAP after extractor model
- Reduced feature dimension to 512
- Applied cosine softmax to classify a number of classes, scale value was automatically determined by fixed adacos. Margin value was set to 0
- weighted CE to deal with imbalanced classes,
- Progressive increase in image sizes reliably boosted scores

Note: In Google Landmarks task was only to generate good features. No post-processing was allowed, so it's not covered here.

2nd Place Solution to Google Landmark Retrieval Competition 2020

Used concatenation of features from 2 models as an ensembling technique
arcmargin loss
Progressive resize

3rd Place Solution to “Google Landmark Retrieval 2020”
- Post- processing methods: DBA [1] , QE [12] and re-rank.
- Used Corner-Cutmix. Reasoning:
  1. Important features are often in the centre, so don't want to cover it.
  2. Networks learns to look at image in different scales, which is useful for real-life scenarios
Triplet loss Often mentioned in earlier papers. Takes a triplet (baseline (anchor) input, positive (truthy) input and a negative (falsy) input). Tries to make embeddings for simmilar object close to each other and for not-simmilar further. Distance is Euclidian.

zakajd commented 4 years ago

Fine-tuning CNN Image Retrieval with No Human Annotation 2018 Introduced Generalized Mean Pooling (GeM)
- Fine-tuning models trained on ImageNet on query data boosts performance
- Learning the whitening through the training data works better, than performing it on short representations But it's slower
- Hard-positive mining could boost performance and is unresearched area of study. Hard-negative mining is a standard process [6], [16]
- Typical architectures for metric learning: two-branch siamese [39], [40], [41] and triplet networks [42], [43], [44]. They employ matching and non-matching pairs to perform the training.
- Last layer in CNN should be ReLU, so that all features are non-negative
- GeM is the same thing as LPPooling with L2 normalization afterwards. GeM power parameter is learned, but in practice close to 3.
- Contrastive loss generalizes better and converges at higher performance than the triplet loss (see formula below)

This part I didn't fully understand, so will leave here for the future

It has recently become a standard policy to combine CNN global image descriptors with simple average query expansion (AQE) [10], [11], [12], [27]. An initial query is issued by Euclidean search and AQE acts on the top-ranked nQE images by average pooling of their descriptors. Herein, we argue that tuning nQE to work well across different datasets is not easy. AQE corresponds to a weighted average where nQE descriptors have unit weight and all the rest zero. We generalize this scheme and we propose performing weighted averaging, where the weight of the i-th ranked image is given by (f(q)^T f (i))^\alpha . The similarity of each retrieved image matters. We show in our experiments that AQE is difficult to tune for datasets of different statistics, while this is not the case with the proposed approach. We refer to this approach as α-weighted query expansion (αQE). The proposed αQE reduces to AQE for α = 0.

zakajd commented 4 years ago

Three things everyone should know to improve object retrieval

using a square root (Hellinger) kernel instead of the standard Euclidean dis- tance to measure the similarity between SIFT descriptors leads to a dramatic performance boost

zakajd commented 4 years ago

[Two-stage Discriminative Re-ranking for Large-scale Landmark Retrieval] (https://arxiv.org/pdf/2003.11211.pdf) Another paper with Code: https://github.com/lyakaap/Landmark2019-1st-and-3rd-Place-Solution This is winning solution of Google Landmark Retrieval Challenge 2019
- GeM for pooling
- Feature size: 512
- Training with "soft" augs first, then with "hard"
- Split images by aspect ration to avoid strong resize distortions
- Bigger images at the end
- Concat different models descriptors into single vector (512 * 6 = 3072)
- TTA: Scale factors, desciptors are then averaged.
  - Scales: [1 / 2 (1 / 2), 1.0, 2 (1 / 2)]
- Similarity search: brute-force euclidean search with L2-normalized descriptors

Re-ranking:
1. AQE - average embeddings of top-ranked images retrieved by an initial query, and use the averaged embedding as a new query.
2. alpha QE - weighted average of descriptors of top-ranked images. Heavier weights are put on as the rank gets higher.
  
  Cosine softmax losses impose L2-constraint to the features which restricts them to lie on a hypersphere of a fixed radius

Didn't understand:

Discriminative Re-ranking
Soft-voting with spatial verification

WHAT IS THE BEST PRACTICE FOR CNNS APPLIED TO VISUAL INSTANCE RETRIEVAL? Paper from 2016 containing distilled wisdom from ancient times.

Use feature maps after the non-linear activations (ReLU) so that the elements in each feature map are all non-negative.

In our multi-scale approach, the regional vectors from each scale are simply added together and l2-normalized to form the scale-level feature vectors. This works better than concatenating them to form long vector

zakajd commented 4 years ago

CosFace: Large Margin Cosine Loss for Deep Face Recognition 2018 LMCL loss
ArcFace: Additive Angular Margin Loss for Deep Face Recognition 2019 Very similar to CosFace and experiments are not very convincing. So it makes sence to always try both.

After the last convolutional layer, we explore the BN [14]-Dropout [31]-FC-BN structure to get the final 512-D embedding feature.

Here is good illustration how all this works (taken from Subcenter ArcFace page) subcenterarcfaceframework

zakajd commented 4 years ago

How to initialise weights in ArcFace / CosFace loss?

We found that initializing the softmax classifier weight with normal distribution std=0.001 generally leads to better performance. It is also important to use larger learning rate for the classifier if underlying CNN is already pretrained.

In code from last year winners used _xavier_uniform__

Metrics: Acc@1 mAP@10

Cumulative Matching Characteristics (CMC) are not defined in case of multi-gallery-shot (multiple correct answers for one image). I'll do same as in Market-1501 dataset: find position of the first correct image and ignore everything else.

Image Retrieval Based on Learning to Rank and Multiple Loss
- Query Expansion can drastically boost performance. AQE and alphaQE are analyzed and later shows better and more stable results. alpha=5, nQE=50. Discriptors for all those images are summed and renormalized and used in a new search.
- Model architecture: Remove last pooling and Fully Connected layers. Add GeM pooling, Lw whitening (didn't understand) and L2 regularisation.
- Multi-scale evaluation at test time. Resize images, collect features and sum them (or concatenate?)

zakajd commented 4 years ago

6th Place Solution Google Landmark Retrieval Challenge 2018

Our network consists of the convolutional layers of ResNet101 pre-trained on ImageNet, followed by generalized-mean pooling (GeM), l2 normalization, a fully-connected (FC) layer, and a final l2 normalization.

I liked how strongly diffusion boosted their scores on both public / private datasets. Definitely a thing to try in the future.

We use diffusion, a graph-based query expansion technique, to perform retrieval. We initially construct an affinity matrix with reciprocal k-nearest neighbors (k=50), and then initiate diffusion by the 5 Euclidean nearest neighbors of the query descriptor.

1st Place Solution Google Landmark Retrieval Challenge 2018

In the MAC architecture, the last convolutional layer of ResNeXt is followed by a max-pooling layer, L2-normalization layer and PCA+Whitening layer.

https://github.com/facebookresearch/faiss Library for efficient search in large vector spaces. May be not relevant now, but would definitely help if number of vectors were much bigger.

Landmark Retrieval via Local Guidance and Global Expansion

The pipeline of our image retrieval system is illustrated in Fig. 1. It consists of five key steps: (1) Deep local feature (DELF) search. (2) Sptatial verification (RANSAC) on top-100 results of (1). (3) Deep image retrieval (DIR) search with database-side feature augmentation (DBA). (4) Query expansion with top-5 results of (3) and the results of (2) with inlier > 40. (5) Re-ranking by regional diffusion. In the next sections, we go over the details of the different components of our pipeline, and also explain how they tie together.

zakajd commented 4 years ago

https://arxiv.org/pdf/1610.07940.pdf Description of DBA and QE

zakajd commented 4 years ago

One more iteration of paper readings.

A Metric Learning Reality Check

The trunk model is an ImageNet pretrained BN-Inception network, with output embedding size of 128. BatchNorm parameters are frozen during training, to reduce overfitting.

Embeddings are L2 normalized before computing the loss, and during evaluation.
Use MAP@R metric and show it to correlate with performance better, than other methods.
Training and validation sets are always class-disjoint, so optimizing for validation set performance should be a good proxy for accuracy on open-set tasks.
ArcFace is best on 2 datasets and almost the best on another. See table

Optimal parameters for different datasets. Margin is 20-40x bigger, than what is used in my experiments.

Batch normalization layer is vital just before the L2-normalized features multiplication by the weights. It is indeed mentioned in the original article, but I didn't pay attention at first, and ArcFace started working for me only after I added it.

zakajd commented 4 years ago

A Benchmark on Tricks for Large-scale Image Retrieval

Train datasetwas noisy, so used DBSCAN for filtering. Trained model with some decent performance, then for each class clustered all images and took biggest group, discarding all others. Used this data for training.
Another approach is to cluster all test data, assign synthetic group labels and use this data jointly with train. Slightly boosts performance due to large number of image noise.
The PCA/Whitening improved mAP a bit more when the output dimension was set to 1024.
Extracting features from image with different scale (1, sqrt(2), 1/sqrt(2)) helps improve performance. They can be combined by concatenation + L2 norm, sum-aggragation + L2 norm and used individually with results for different scales combined at rank predictions.

zakajd commented 4 years ago

18. Large-scale Landmark Retrieval/Recognition under a Noisy and Diverse Dataset

Image Matching Across Wide Baselines: From Paper to Practice

zakajd commented 4 years ago

1st place solution Google Landmark Recognition 2020 telegram-cloud-photo-size-2-5453894556912299785-y

One more important thing to note here is that when calculating cosine similarity between different sets, the similarity metric benefits from similarly scaled vectors. So what we do is that we fit a QuantileTransformer (other scalers work similarly well) on the test set

SuperPoint + SuperGlue for local re-ranking

3rd place: Generalized mean pooling with frozen p set to 3 was used; bottleneck structure (GEMPool(2048) -> Linear(512) -> BatchNorm1d -> CosFace(81313)) to reduce computation.

PCA learned on index set to reduce dimensionality of ensemble (512 * 8 -> 1024)

According to the above problems, we select SURF[2] and Hassian-Affine[11, 12] root sift[1] as our local feature method.

Random paper

telegram-cloud-photo-size-2-5438453393568935984-y

According to the above problems, we select SURF[2] and Hassian-Affine[11, 12] root sift[1] as our local feature method. Our local feature image retrieval system is based on nearest neighbor search. To speed up the nearest neigh- bor search, we construct an inverted index which is imple- mented by a k-means clustering with 512 centers, for each point of query image we select top 20 clustering centers to search top1 clustering center of all points descriptors in database set.

Label smoothing[19] is used in the training model, and the soft label parameters is set to 0.1, 0.2.

telegram-cloud-photo-size-2-5438453393568935991-y

Какой-то дикий лайфхак, благодаря которому сильно поднимаются в landmark retrieval благодаря улучшенной фильтрации мусорных изображений.

zakajd commented 4 years ago

Competition is over, here is a long list of papers I didn't have time to read / implement and repos I didn't have time to look through / copy:

Deep Metric Learning with Angular Loss 2017
Large-Scale Image Retrieval with Attentive Deep Local Features DELF paper
Unifying Deep Local and Global Features for Image Search DELG paper
MAGSAC more efficient algo than old good RANSAC
DEGENSAC better than MAGSAC, couldn't install
Detect-to-Retrieve: Efficient Regional Aggregation for Image Search R-ASMK paper

zakajd / huawei2020

Paper review #1

Random paper