yahoojapan / NGT

Nearest Neighbor Search with Neighborhood Graph and Tree for High-dimensional Data
Apache License 2.0
1.24k stars 114 forks source link

Jaccard distance generalization #24

Closed chunjiangzhu closed 5 years ago

chunjiangzhu commented 5 years ago

Hi, thanks again for the prompt responses to the previous question #23 .

We are trying to generalize your code to Jaccard distance and test it under the ann benchmarking. We assume that the input is the same as the input of hamming distance, but the distance function is changed to Jaccard. E.g., for two bit vectors "A=10111" and "B=10011", their hamming distance is 1 but jaccard distance is 1-popcount(A&B)|/popcount(A|B)=1-3/4=0.25.

What we did are that for every code including hamming, generating corresponding code for jaccard. For example in our repository, https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/lib/NGT/PrimitiveComparator.h#L287-L303

https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/lib/NGT/ObjectSpaceRepository.h#L97-L114

https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/lib/NGT/Index.h#L113

https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/lib/NGT/Command.cpp#L155-L157

https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/lib/NGT/ObjectSpace.h#L162

https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/python/src/ngtpy.cpp#L67-L68

https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/python/ngt/base.py#L131

https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/python/ngt/base.py#L254-L256

Our input data is an N*1024 numpy array of data type int (or bool). We have tested the code using parameters epsilon=[0.0,0.1], edge_size=[100, 200, 300, 500, 1000], outdegree=[10, 30, 50, 70, 100], indegree=[10, 30, 50, 70, 120], query epsilon=[0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0] and object_type=Byte. But the resulting recall are consistently lower than 20%. We believe that there are something wrong. We noticed the 16 boundary you mentioned in #21 . Since our data dimension is always a power of 16, e.g. 1024, the error should not be here.

Could you please give suggestions on the generalization? We hope that a successful generalization may be helpful to support more distance metrics, before you make a custom distance function. If you need more information, e.g. a dataset, please let me know. Thank you so much!

masajiro commented 5 years ago

Could you replace the search() function below with linearSearch() which does not use the index to search. https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/python/src/ngtpy.cpp#L149 Then, if the ann benchmarks' recalls always show 1.0, the implementation of the distance must be almost correct. However, I am not sure that the distance satisfies conditions of metric space especially triangle inequality. If not, you cannot get high recalls with the distance, because NGT and many of the other methods are based on metric space.

chunjiangzhu commented 5 years ago

Thanks so much for the reply! I have applied the change from search() to linearSearch() as in https://github.com/chunjiangzhu/ngt/blob/44fdeaaa794b9d2dc62ad69e60af898ae2d0b4ef/python/src/ngtpy.cpp#L149

But got the following error:

File "run_algorithm.py", line 3, in run_from_cmdline() File "/home/app/ann_benchmarks/runner.py", line 190, in run_from_cmdline run(definition, args.dataset, args.count, args.runs, args.batch) File "/home/app/ann_benchmarks/runner.py", line 134, in run distance, count, run_count, batch) File "/home/app/ann_benchmarks/runner.py", line 71, in run_individual_query results = [single_query(x) for x in X_test] File "/home/app/ann_benchmarks/runner.py", line 71, in results = [single_query(x) for x in X_test] File "/home/app/ann_benchmarks/runner.py", line 37, in single_query candidates = algo.query(v, count) File "/home/app/ann_benchmarks/algorithms/onng_ngt.py", line 90, in query results = self.index.search(v, n, self._epsilon, self._edge_size_for_search, with_distance=False) RuntimeError: /usr/local/include/NGT/Common.h:1815: Inner error: results is not set

I understand your concern, but Jaccard distance is a metric. Its triangle inequality property was proved in https://www.sciencedirect.com/science/article/pii/S0167865518309188 :)

masajiro commented 5 years ago

Thank you for your information. I completely forgot that the specification of linearSearch() is different from that of search(). Could you update ngtpy.cpp as below. I have tested this source code.

--- a/python/src/ngtpy.cpp
+++ b/python/src/ngtpy.cpp
@@ -143,30 +143,21 @@ public:
     sc.setSize(size);                          // the number of resultant objects.
     sc.setEpsilon(epsilon);                    // set exploration coefficient.
     sc.setEdgeSize(edgeSize);                  // if maxEdge is minus, the specified value in advance is used.
+    NGT::ObjectDistances objects;
+    sc.setResults(&objects);

-    NGT::Index::search(sc);
+    NGT::Index::linearSearch(sc);

     numOfDistanceComputations += sc.distanceComputationCount;

     NGT::Index::deleteObject(ngtquery);
     if (!withDistance) {
-      NGT::ResultPriorityQueue &r = sc.getWorkingResult();
-      py::array_t<int> ids(r.size());
+      py::array_t<int> ids(objects.size());
       py::buffer_info idsinfo = ids.request();
-      int *endptr = reinterpret_cast<int*>(idsinfo.ptr); 
-      int *ptr = endptr + (r.size() - 1);
-      if (zeroNumbering) {
-        while (ptr >= endptr) {
-         *ptr-- = r.top().id - 1;
-         r.pop();
-        }
-      } else {
-        while (ptr >= endptr) {
-         *ptr-- = r.top().id;
-         r.pop();
-        }
+      int *ptr = reinterpret_cast<int*>(idsinfo.ptr); 
+      for (size_t oidx = 0; oidx < objects.size(); ++oidx) {
+        ptr[oidx] = objects[oidx].id - 1;
       }
-
       return ids;
     }
     py::list results;
chunjiangzhu commented 5 years ago

Yes, the distances using linearScan() are all correct. Thank you for the efforts!

486: ONNG-NGT(500, 30, 10, -2, 1.000) 1.000 334.322 487: ONNG-NGT(500, 30, 10, -2, 0.800) 1.000 343.043 488: ONNG-NGT(500, 30, 10, -2, 0.100) 1.000 331.162 489: ONNG-NGT(500, 30, 10, -2, 0.600) 1.000 339.045 490: ONNG-NGT(500, 30, 10, -2, 1.800) 1.000 344.334 491: ONNG-NGT(500, 30, 10, -2, 2.000) 1.000 337.646 492: ONNG-NGT(500, 30, 10, -2, 0.400) 1.000 335.073 493: ONNG-NGT(500, 30, 10, -2, 0.200) 1.000 336.272 494: ONNG-NGT(500, 30, 10, -2, 1.200) 1.000 348.053 495: ONNG-NGT(500, 30, 10, -2, 1.400) 1.000 333.685 496: ONNG-NGT(500, 30, 10, -2, 1.600) 1.000 333.825

But when I turned it back to scan(), the recall dropped down again... :(

486: ONNG-NGT(300, 30, 30, -2, 1.000) 0.027 7200.361 487: ONNG-NGT(300, 30, 30, -2, 1.200) 0.028 6220.843 488: ONNG-NGT(300, 30, 30, -2, 0.100) 0.026 43036.158

masajiro commented 5 years ago

I assume that the dimensionality of your dataset is 1024 * 8. It means that your data space is so sparse that NGT might not work well for the space. Anyway, you have to increase the epsilon for NGT construction. For example, when you use 1.8 as the epsilon for search, you might want to use 0.8(=1.8-1.0) for construction instead of [0.0, 0.1]. The epsilon only for search is added 1.0, because the ann benchmarks cannot accept minus value. Inverted index-based methods might be better for such sparse dataset.

chunjiangzhu commented 5 years ago

Thank you for the reply and suggestion! Well, in my opinion, the problem should not lie in the data sparsity. The first reason is that the SIFT data has dimension only 128. Either 128*32 or 128*64 are not greater than 1024*8 and ONNG performs not bad for it. The second reason is that for our own datasets, when I changed the distance metric to Euclidean, the recall performance looks "normal" ranging from 20%-95%. I believe that there are some bugs in my generalizations. Will let you know once I find any problems. Have a nice weekend.

chunjiangzhu commented 5 years ago

I have found bugs in Graph.h and Graph.cpp in below, where I should put corresponding codes for Jaccard. It seems that the comparator has not been invoked before. Now it works good. It was a very nice talk with you. Thanks so much for the helpful suggestions!

https://github.com/chunjiangzhu/ngt/blob/846e55f72e7ce26472bbad11b7d04e92b8645a09/lib/NGT/Graph.h#L279

https://github.com/chunjiangzhu/ngt/blob/846e55f72e7ce26472bbad11b7d04e92b8645a09/lib/NGT/Graph.h#L293

masajiro commented 5 years ago

The number of dimensions of your data with your jaccard distance is 1024 8, because the jaccard distance is based on 1 bit for each dimension. Although the sift's data length is 128 32, the number of dimensions of the sift with euclidean distance is just 128, because the euclidean distance is based on 1 single float variable (32 bits) for each dimension. Therefore your data with your jaccard distance is supposed to be sparser than the sift with the euclidean distance.

masajiro commented 5 years ago

@chunjiangzhu I added jaccard distance. Your code is quite helpful to implement it!