yahoojapan / NGT

Nearest Neighbor Search with Neighborhood Graph and Tree for High-dimensional Data
Apache License 2.0
1.24k stars 114 forks source link

Sparse Jaccard via python interface #109

Open lmcinnes opened 2 years ago

lmcinnes commented 2 years ago

I have been endeavoring to get more ANN libraries working with sparse Jaccard data in ann-benchmarks. There are surprisingly few libraries that support this. I was pleased to see that NGT does have support for sparse Jaccard, however (as far as I can tell) this is only accessible via the C++ API, and not via python (which ann-benchmarks would require).

Is it possible to add sparse Jaccard support to the python interface? If it is not too challenging I could attempt to provide a PR myself, but that would likely take some time, as I have little familiarity with the code.

masajiro commented 2 years ago

Sorry for the late response. Since NGT python bindings (ngtpy) support Jaccard, I think that you only have to update the ann-benchmarks ngt file. If the data format ann-benchmarks required is different from that of ngtpy, please let me know. I will help you.

lmcinnes commented 2 years ago

Thanks for the reply. My concern was with the input data format for the Jaccard benchmark in ann-benchmarks. It is a (74962 x 27983) sparse matrix. This can be formatted as a boolean numpy array of shape (74962, 27983) which I believe can be fed to NGT's ngtpy jaccard. Given the very high dimensionality this will be slower than it could be. I will try this none the less and get back to you with preliminary results.

It seemed that the SparseJaccard comparator can act directly on the lists of indices of non-zero elements per row -- essentially directly on the sparse data format -- which will be much faster (other algorithms that support the Jaccard benchmark dataset in ann-benchmarks essentially do this). If possible I would prefer to be able to use this approach, as it would give a much fairer comparison in ann-benchmarks.

masajiro commented 2 years ago

Although I surely implemented Jaccard into NGT, I had almost forgotten that. There are two types of Jaccard for dense and sparse on NGT. However, as you mentioned, it seems that sparse Jaccard is not available on ngtpy. I will try to implement sparse jaccard into ngtpy if it is easy.

lmcinnes commented 2 years ago

Thanks. My initial efforts with the dense matrix are mostly just timing out on the runs, so I fear it is too expensive. Hopefully the sparse jaccard version is not too challenging to make available in ngtpy. It would certainly be useful for many people, but it is definitely a less used feature, so if it is too difficult to implement, please don't invest time in it. Thanks again for all your work.