zakajd / huawei2020

Solution of Huawei Digix Global AI Challenge
1 stars 0 forks source link

Paper review #1

Open zakajd opened 4 years ago

zakajd commented 4 years ago
  1. 1st Place Solution to Google Landmark Retrieval 2020
    • Used GAP after extractor model
    • Reduced feature dimension to 512
    • Applied cosine softmax to classify a number of classes, scale value was automatically determined by fixed adacos. Margin value was set to 0
    • weighted CE to deal with imbalanced classes,
    • Progressive increase in image sizes reliably boosted scores
Screenshot 2020-09-10 at 02 33 48

Note: In Google Landmarks task was only to generate good features. No post-processing was allowed, so it's not covered here.

  1. 2nd Place Solution to Google Landmark Retrieval Competition 2020
image
  1. 3rd Place Solution to “Google Landmark Retrieval 2020”

    • Post- processing methods: DBA [1] , QE [12] and re-rank.
    • Used Corner-Cutmix. Reasoning:
      1. Important features are often in the centre, so don't want to cover it.
      2. Networks learns to look at image in different scales, which is useful for real-life scenarios image
  2. Triplet loss Often mentioned in earlier papers. Takes a triplet (baseline (anchor) input, positive (truthy) input and a negative (falsy) input). Tries to make embeddings for simmilar object close to each other and for not-simmilar further. Distance is Euclidian.

    image
zakajd commented 4 years ago
  1. Fine-tuning CNN Image Retrieval with No Human Annotation 2018 Introduced Generalized Mean Pooling (GeM)
    • Fine-tuning models trained on ImageNet on query data boosts performance
    • Learning the whitening through the training data works better, than performing it on short representations But it's slower
    • Hard-positive mining could boost performance and is unresearched area of study. Hard-negative mining is a standard process [6], [16]
    • Typical architectures for metric learning: two-branch siamese [39], [40], [41] and triplet networks [42], [43], [44]. They employ matching and non-matching pairs to perform the training.
    • Last layer in CNN should be ReLU, so that all features are non-negative
    • GeM is the same thing as LPPooling with L2 normalization afterwards. GeM power parameter is learned, but in practice close to 3.
    • Contrastive loss generalizes better and converges at higher performance than the triplet loss (see formula below) image

This part I didn't fully understand, so will leave here for the future

It has recently become a standard policy to combine CNN global image descriptors with simple average query expansion (AQE) [10], [11], [12], [27]. An initial query is issued by Euclidean search and AQE acts on the top-ranked nQE images by average pooling of their descriptors. Herein, we argue that tuning nQE to work well across different datasets is not easy. AQE corresponds to a weighted average where nQE descriptors have unit weight and all the rest zero. We generalize this scheme and we propose performing weighted averaging, where the weight of the i-th ranked image is given by (f(q)^T f (i))^\alpha . The similarity of each retrieved image matters. We show in our experiments that AQE is difficult to tune for datasets of different statistics, while this is not the case with the proposed approach. We refer to this approach as α-weighted query expansion (αQE). The proposed αQE reduces to AQE for α = 0.

zakajd commented 4 years ago
  1. Three things everyone should know to improve object retrieval

using a square root (Hellinger) kernel instead of the standard Euclidean dis- tance to measure the similarity between SIFT descriptors leads to a dramatic performance boost

zakajd commented 4 years ago
  1. [Two-stage Discriminative Re-ranking for Large-scale Landmark Retrieval] (https://arxiv.org/pdf/2003.11211.pdf) Another paper with Code: https://github.com/lyakaap/Landmark2019-1st-and-3rd-Place-Solution This is winning solution of Google Landmark Retrieval Challenge 2019
    • GeM for pooling
    • Feature size: 512
    • Training with "soft" augs first, then with "hard"
    • Split images by aspect ration to avoid strong resize distortions
    • Bigger images at the end
    • Concat different models descriptors into single vector (512 * 6 = 3072)
    • TTA: Scale factors, desciptors are then averaged.
      • Scales: [1 / 2 (1 / 2), 1.0, 2 (1 / 2)]
    • Similarity search: brute-force euclidean search with L2-normalized descriptors

Didn't understand:

WHAT IS THE BEST PRACTICE FOR CNNS APPLIED TO VISUAL INSTANCE RETRIEVAL? Paper from 2016 containing distilled wisdom from ancient times.

In our multi-scale approach, the regional vectors from each scale are simply added together and l2-normalized to form the scale-level feature vectors. This works better than concatenating them to form long vector

zakajd commented 4 years ago
  1. CosFace: Large Margin Cosine Loss for Deep Face Recognition 2018 LMCL loss

    image
  2. ArcFace: Additive Angular Margin Loss for Deep Face Recognition 2019 Very similar to CosFace and experiments are not very convincing. So it makes sence to always try both.

    image

After the last convolutional layer, we explore the BN [14]-Dropout [31]-FC-BN structure to get the final 512-D embedding feature.

Here is good illustration how all this works (taken from Subcenter ArcFace page) subcenterarcfaceframework

zakajd commented 4 years ago

How to initialise weights in ArcFace / CosFace loss?

We found that initializing the softmax classifier weight with normal distribution std=0.001 generally leads to better performance. It is also important to use larger learning rate for the classifier if underlying CNN is already pretrained.

In code from last year winners used _xavier_uniform__

Metrics: Acc@1 mAP@10

Cumulative Matching Characteristics (CMC) are not defined in case of multi-gallery-shot (multiple correct answers for one image). I'll do same as in Market-1501 dataset: find position of the first correct image and ignore everything else.

image
  1. Image Retrieval Based on Learning to Rank and Multiple Loss
    • Query Expansion can drastically boost performance. AQE and alphaQE are analyzed and later shows better and more stable results. alpha=5, nQE=50. Discriptors for all those images are summed and renormalized and used in a new search.
    • Model architecture: Remove last pooling and Fully Connected layers. Add GeM pooling, Lw whitening (didn't understand) and L2 regularisation.
    • Multi-scale evaluation at test time. Resize images, collect features and sum them (or concatenate?)
zakajd commented 4 years ago
  1. 6th Place Solution Google Landmark Retrieval Challenge 2018

Our network consists of the convolutional layers of ResNet101 pre-trained on ImageNet, followed by generalized-mean pooling (GeM), l2 normalization, a fully-connected (FC) layer, and a final l2 normalization.

I liked how strongly diffusion boosted their scores on both public / private datasets. Definitely a thing to try in the future.

We use diffusion, a graph-based query expansion technique, to perform retrieval. We initially construct an affinity matrix with reciprocal k-nearest neighbors (k=50), and then initiate diffusion by the 5 Euclidean nearest neighbors of the query descriptor.

  1. 1st Place Solution Google Landmark Retrieval Challenge 2018

In the MAC architecture, the last convolutional layer of ResNeXt is followed by a max-pooling layer, L2-normalization layer and PCA+Whitening layer.

https://github.com/facebookresearch/faiss Library for efficient search in large vector spaces. May be not relevant now, but would definitely help if number of vectors were much bigger.

  1. Landmark Retrieval via Local Guidance and Global Expansion

The pipeline of our image retrieval system is illustrated in Fig. 1. It consists of five key steps: (1) Deep local feature (DELF) search. (2) Sptatial verification (RANSAC) on top-100 results of (1). (3) Deep image retrieval (DIR) search with database-side feature augmentation (DBA). (4) Query expansion with top-5 results of (3) and the results of (2) with inlier > 40. (5) Re-ranking by regional diffusion. In the next sections, we go over the details of the different components of our pipeline, and also explain how they tie together.

zakajd commented 4 years ago
image

https://arxiv.org/pdf/1610.07940.pdf Description of DBA and QE

zakajd commented 4 years ago

One more iteration of paper readings.

  1. A Metric Learning Reality Check

The trunk model is an ImageNet pretrained BN-Inception network, with output embedding size of 128. BatchNorm parameters are frozen during training, to reduce overfitting.

Optimal parameters for different datasets. Margin is 20-40x bigger, than what is used in my experiments.

image

Batch normalization layer is vital just before the L2-normalized features multiplication by the weights. It is indeed mentioned in the original article, but I didn't pay attention at first, and ArcFace started working for me only after I added it.

zakajd commented 4 years ago
  1. A Benchmark on Tricks for Large-scale Image Retrieval
zakajd commented 4 years ago
  1. Team JL Solution to Google Landmark Recognition 2019

  2. 2nd Place and 2nd Place Solution to Kaggle Landmark Recognition and Retrieval Competition 2019

18. Large-scale Landmark Retrieval/Recognition under a Noisy and Diverse Dataset

  1. Image Matching Across Wide Baselines: From Paper to Practice
zakajd commented 4 years ago

1st place solution Google Landmark Recognition 2020 telegram-cloud-photo-size-2-5453894556912299785-y

One more important thing to note here is that when calculating cosine similarity between different sets, the similarity metric benefits from similarly scaled vectors. So what we do is that we fit a QuantileTransformer (other scalers work similarly well) on the test set

SuperPoint + SuperGlue for local re-ranking

3rd place: Generalized mean pooling with frozen p set to 3 was used; bottleneck structure (GEMPool(2048) -> Linear(512) -> BatchNorm1d -> CosFace(81313)) to reduce computation.

PCA learned on index set to reduce dimensionality of ensemble (512 * 8 -> 1024)

According to the above problems, we select SURF[2] and Hassian-Affine[11, 12] root sift[1] as our local feature method.

Random paper

telegram-cloud-photo-size-2-5438453393568935984-y

According to the above problems, we select SURF[2] and Hassian-Affine[11, 12] root sift[1] as our local feature method. Our local feature image retrieval system is based on nearest neighbor search. To speed up the nearest neigh- bor search, we construct an inverted index which is imple- mented by a k-means clustering with 512 centers, for each point of query image we select top 20 clustering centers to search top1 clustering center of all points descriptors in database set.

Label smoothing[19] is used in the training model, and the soft label parameters is set to 0.1, 0.2.

telegram-cloud-photo-size-2-5438453393568935991-y

Какой-то дикий лайфхак, благодаря которому сильно поднимаются в landmark retrieval благодаря улучшенной фильтрации мусорных изображений.

zakajd commented 4 years ago

Competition is over, here is a long list of papers I didn't have time to read / implement and repos I didn't have time to look through / copy: