Help with PCA/Spark docs

yahoo / lopq

Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark.

Apache License 2.0

561 stars 130 forks source link

Help with PCA/Spark docs #6

Closed rcompton closed 8 years ago

rcompton commented 8 years ago

In the LOPQ Training section at https://github.com/yahoo/lopq/blob/master/spark/README.md I see:

Here is an example of training a full model from scratch and saving the model parameters as both a pickle file and a protobuf file:

spark-submit train_model.py \
    --data /hdfs/path/to/data \
    --V 16 \
    --M 8 \
    --model_pkl /hdfs/output/path/model.pkl \
    --model_proto /hdfs/output/path/model.lopq

But above that, in the PCA Training section, I see:

A necessary preprocessing step for training is to PCA and variance balance the raw data vectors to produce the LOPQ data vectors, i.e. the vectors that LOPQ will quantize.

It's not clear to me how the outputs of the train_pca.py script are supposed to feed into the train_model.py script. Am I supposed to use the results of train_pca.py to do the variance balancing myself and then feed that into train_model.py or does the "training a full model from scratch" take care of that step for me?

skamalas commented 8 years ago

Hi Ryan, I think the answer to you question is exactly the https://github.com/yahoo/lopq/blob/master/spark/pca_preparation.py script. This script illustrates how to prepare PCA parameters before using them in the LOPQ pipeline.

The eigenvalue_allocation function does the balancing in as many subspaces as you need, in the script it is set to 2 for multi-index

pumpikano commented 8 years ago

I agree that the documentation could be clearer about this.

rcompton commented 8 years ago

To be clear, prior to training and prior to search it's recommended that I apply:

def apply_PCA(x, mu, P):
    """
    Example of applying PCA.
    """
    return np.dot(x - mu, P)

to each vector, where P has already undergone permuted_inds = eigenvalue_allocation(2, E)?

Is this global PCA step in the paper?

skamalas commented 8 years ago

Yes, you are right!

This is not explained (nor used) in the CVPR paper. Ge et al. mention it in the Optimized Product Quantization PAMI paper. The effect is small for SIFT features, as they are in general balanced in terms of variance from the start in these datasets.

If you want to use LOPQ with dimensionality reduced CNN features however (that was shown to be good practice eg like in the "Neural Codes for Image Retrieval" paper by Babenko), a permutation of dimensions with eigenvector_allocation gives a big boost in performance.

rcompton commented 8 years ago

Great, thanks!