yahoojapan / NGT

Nearest Neighbor Search with Neighborhood Graph and Tree for High-dimensional Data
Apache License 2.0
1.24k stars 114 forks source link

Question on dataset characteristics #20

Closed kclarke1222 closed 5 years ago

kclarke1222 commented 5 years ago

I was curios if these algorithms would work with binary data? I didn't see anything about that in the the papers on the algorithms.

masajiro commented 5 years ago

NGT works for binary data with hamming distance as well, although I did not evaluate for hamming distance. When you use hamming distance for binary data, you have to specify 1 byte unsigned integer for the object type and hamming distance for the distance function with NGT create command. Also you should probably make the search range coefficient larger than its default for hamming distance.

kclarke1222 commented 5 years ago

how about for the jaccard distance? We are operating on binary data, and it is the more typical distance measure in chemistry applications.

masajiro commented 5 years ago

Although I have not used the jaccard distance, I think that NGT can work for the jaccard distance as well. But, if the dimensionality of your data is large, NGT cannot effectively handle your data, because NGT cannot handle sparse data. Anyway since the distance is not implemented at this point, you have to implement it, if you want it in a hurry.

masajiro commented 5 years ago

I added jaccard distance.