yahoo / lopq

Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark.
Apache License 2.0
563 stars 130 forks source link

python search ignores fine codes #11

Closed pedro-r-marques closed 7 years ago

pedro-r-marques commented 7 years ago

From reading the code and executing it under pdb, it appears that the python search algorithm (both dict, and lmdb) ignores the fine codes and returns results only on the coarse cells. Can you please confirm whether this is the case ?

pumpikano commented 7 years ago

Conceptually, there are two phases of the search algorithm. The first phase retrieves candidate items from the index by looking up the candidates by their coarse codes; this is implemented by LOPQSearcherBase.get_result_quota. The second phase uses the fines codes associated with the candidate index points to compute a distance for each point; this is implemented by LOPQSearcherBase.compute_distances (the algorithmic work is actually done in LOPQModel.get_subquantizer_distances which is called as a subroutine).

Your understanding of the paper is correct: the paper analyzes the case where residual encoders (called "subquantizers" in this repo) are fit to residuals within each cluster. In the paper's formulation, there are 2K clusters total (K cluster for each of the 2 splits of the top-level product quantizer), which yield a total K**2 cells, and subquantizers are fit to the residuals in each of the 2K clusters independently, resulting in 2KM subquantizers. This repo fits subquantizers to residuals across all cells together, resulting in M subquantizers. Doing this saves a factor of 2K subquantizer memory at the expense of slight reduction in recall performances (usually on the order of 1 or 2%). For a large models, this space saving can be over 1Gb and is usually worth the performance hit. Table 1 of the paper refers to this as "LOR+PQ" for "locally optimized rotations + product quantization" (though that version does not use the multi-index).

Note that a rotation matrix is fit to the data in each cluster in both cases, resulting in 2K rotations. The only change in the figure to reflect the global residual subquantizers would be that the grid imposed on each cluster would be identical up to translations and rotations.

pedro-r-marques commented 7 years ago

Thank you for your explanation. I read the original PQ paper and it became clearer that the fine codes are used to avoid the cost of computing the euclidian distance on rerank (and storing the full vector in memory)