yahoo / lopq

Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark.
Apache License 2.0
562 stars 130 forks source link

Got wrong result by using simple python code in sift1m dataset #18

Open FartFang opened 6 years ago

FartFang commented 6 years ago

I rewrite example.py by changing the replacing input dataset 'sift1m', and i got a result that seems like a wrong evaluation:

Recall (V=16, M=8, subquants=256): [0.2018 0.4247 0.5168 0.5218]
Recall (V=16, M=16, subquants=256): [0.3124 0.5057 0.5218 0.5218]
Recall (V=16, M=8, subquants=512): [0.2219 0.4477 0.5198 0.5218]

And i also got a error when i try to use GIST1M dataset:

Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 266, in _feed
    send(obj)
SystemError: NULL result without error in PyObject_Call

Is that the code in python folder is not for the large dataset like SIFI? Or that is some mistake in importing data process?

Looking forward to your reply. Thanks a lot!

pumpikano commented 6 years ago

Regarding the sift1m dataset, I can't tell if anything is wrong from those numbers, but I think it is probably simply that the quantization is too coarse for a dataset with the size/complexity of sift1m. From my notes, I found that my recall around [~0.40 ~0.85 ~0.98 ~0.98] with V=1024, M=8, subquants=256. Fitting the quantizers in this case might take a while (an hour or two) on a single machine.

I have not tried gist1m, but it could be a memory issue (cf. https://bugs.python.org/issue17560). I can't tell from the info that you shared where in the program this is happening though. If it is during index building, an easy thing to try is parallelize index building more by increasing num_procs: https://github.com/yahoo/lopq/blob/master/python/lopq/search.py#L85

In any case, the code in python/ assumes that the full datasets fits in memory. This assumption would need to be changed if this is not the case for you (or try the Spark code instead).

FartFang commented 6 years ago

Thanks a lot,I change V=16 to V=1024,and I got the result seems more correct than last time. BTW,where is the API which can change the value of w according to the paper? And what is the default value of w in your implement? @pumpikano