yahoojapan / NGT

Nearest Neighbor Search with Neighborhood Graph and Tree for High-dimensional Data
Apache License 2.0
1.22k stars 112 forks source link

"Byte" object type #23

Closed chunjiangzhu closed 5 years ago

chunjiangzhu commented 5 years ago

Hi, is the object type "Byte" well developed and fully tested in the current code? By the way, I assume the algorithms work for custom distance function, am I right? Thank you!

chunjiangzhu commented 5 years ago

By the way, I often got the following error in my run: "Error! startEpsilon should be reduced for the specified range.". Can you see any obvious implications from it? Thank you so much!

"got a train set of size (941 * 1024) got 100 queries ngt::reconstructGraph: Extract the graph data. Reconstruction time=0.00072567:0.00114803:0.00380066 original edge size=10 reverse edge size=60 ngt::Graph reconstruction time=0.00719446 (sec) GraphReconstructor::adjustPaths: graph preparing time=0.000302463 (sec) GraphReconstructor::adjustPaths extracting removed edge candidates time=0.152378 (sec) ngt::Path adjustment time=0.182657 (sec) adjustBaseSearchEdgeSize::Extract queries for GT... adjustBaseSearchEdgeSize::create GT... adjustBaseSearchEdgeSize::explore for the mergin 0.2... adjustBaseSearchEdgeSize::Base edge size=10, distance computation=1.48209 Warning: Cannot adjust the base edge size for mergin 0.2. /home/app/ngt/lib/NGT/Optimizer.h:427: exploreEpsilonForAccuracy: Error! startEpsilon should be reduced for the specified range. Try again for the next mergin. adjustBaseSearchEdgeSize::explore for the mergin 0.25... adjustBaseSearchEdgeSize::Base edge size=10, distance computation=1.48209 Warning: Cannot adjust the base edge size for mergin 0.25. /home/app/ngt/lib/NGT/Optimizer.h:427: exploreEpsilonForAccuracy: Error! startEpsilon should be reduced for the specified range. Try again for the next mergin. adjustBaseSearchEdgeSize::explore for the mergin 0.3... adjustBaseSearchEdgeSize::Base edge size=10, distance computation=1.48209 Warning: Cannot adjust the base edge size for mergin 0.3. /home/app/ngt/lib/NGT/Optimizer.h:427: exploreEpsilonForAccuracy: Error! startEpsilon should be reduced for the specified range. Try again for the next mergin."

masajiro commented 5 years ago

Since the “Byte” type is used for our applications, it basically works well. Note that the “Byte” is implemented as an unsigned variable. If you properly implement your distance function, the “Byte” works. According to your logs, you use NGT on ann benchmarks, don’t you? If so, the size of dataset (941 objects) you used is too small, because NGT on ann benchmarks try to optimize search parameters for the specified dataset. However, even though you find errors during the optimization, default search parameters are used for the benchmarks.

chunjiangzhu commented 5 years ago

Thank you so much for the response! You are right, I am running the ann benchmarking. I noticed that you updated the python driver "onng_ngt.py" under that project. [https://github.com/erikbern/ann-benchmarks/blob/master/ann_benchmarks/algorithms/onng_ngt.py] For the fit function, could you please give some descriptions to help me understand the logic? Thank you!

masajiro commented 5 years ago

For the fit function, could you please give some descriptions to help me understand the logic? Thank you!

The process is the same of the following. https://github.com/yahoojapan/NGT/blob/master/bin/ngt/README.md#onng First, create ANNG. Second, convert the ANNG to ONNG. The conversion consists of

Does that answer your question?

chunjiangzhu commented 5 years ago

Yes, that makes sense!

What confused me was the log information, e.g. print('ONNG: create ANNG'), print('ONNG: ANNG construction time(sec)=' + str(time.time() - t)). In some runs, the log of the subprocess seems to dominate the stdout, and the log I mentioned above does not appear. Maybe this is because of the subprocess calling. But I believe the code were executed successfully because the self.index was successfully set.

I think we are approaching to close this issue. Thanks!

masajiro commented 5 years ago

There are messages of the onng_ngt.py in our log below, although they are a little out of order. The difference might be due to the docker or os environment.

Trying to instantiate ann_benchmarks.algorithms.onng_ngt.ONNG(['euclidean', 'Float', 0.1, {'edge': 100, 'outdegree': 10, 'indegree': 120}])
ONNG: edge_size=100
ONNG: outdegree=10
ONNG: indegree=120
ONNG: edge_size_for_search=-2
ONNG: epsilon=0.1
ONNG: metric=euclidean
ONNG: object_type=Float
got a train set of size (1000000 * 128)
got 10000 queries
Processed 100000 objects. time= 199.102 (sec)
Processed 200000 objects. time= 449.442 (sec)
Processed 300000 objects. time= 538.9 (sec)
Processed 400000 objects. time= 613.441 (sec)
Processed 500000 objects. time= 627.649 (sec)
Processed 600000 objects. time= 708.047 (sec)
Processed 700000 objects. time= 696.557 (sec)
Processed 800000 objects. time= 779.332 (sec)
Processed 900000 objects. time= 748.134 (sec)
Processed 1000000 objects. time= 837.672 (sec)
ngt::reconstructGraph: Extract the graph data.
Processed 1000000 objects.
Processed 100000 nodes
Processed 200000 nodes
Processed 300000 nodes
Processed 400000 nodes
Processed 500000 nodes
Processed 600000 nodes
Processed 700000 nodes
Processed 800000 nodes
Processed 900000 nodes
Processed 1000000 nodes
Reconstruction time=0.739276:18.1838:7.54052
original edge size=10
reverse edge size=120
ngt::Graph reconstruction time=27.9895 (sec)
GraphReconstructor::adjustPaths: graph preparing time=0.992566 (sec)
GraphReconstructor::adjustPaths extracting removed edge candidates time=719.449 (sec)
ngt::Path adjustment time=905.001 (sec)
adjustBaseSearchEdgeSize::Extract queries for GT...
adjustBaseSearchEdgeSize::create GT...
adjustBaseSearchEdgeSize::explore for the mergin 0.2...
adjustBaseSearchEdgeSize::Base edge size=10, distance computation=2.25153
adjustBaseSearchEdgeSize::Base edge size=20, distance computation=2.2905
Reconstruct Graph: adjust the base search edge size. 10
ngtpy::Index read only!
ONNG: start indexing...
ONNG: # of data=1000000
ONNG: dimensionality=128
ONNG: index=indexes/ONNG-100-10-120
ONNG: create ANNG
ONNG: ANNG construction time(sec)=6211.593377828598
ONNG: degree adjustment
ONNG: degree adjustment time(sec)=951.6256878376007
ONNG: index already exists! indexes/ONNG-100-10-120
ONNG: open time(sec)=5.4101881980896
ONNG: end of fit
Built index in 7168.630221128464
Index size:  2625880.0
Running query argument group 1 of 8...
ONNG: epsilon=0.6
Run 1/1...