yahoojapan / NGT

Nearest Neighbor Search with Neighborhood Graph and Tree for High-dimensional Data
Apache License 2.0
1.22k stars 112 forks source link

Optimizer: "Cannot optimize the number of edges" #113

Closed fonspa closed 2 years ago

fonspa commented 2 years ago

Hi, I'm trying to use the Optimizer class in the Python binding, for a version of NGT which I built from sources (v1.14.1) but I encounter a bug which seems to be an off-by-one error.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_28461/550346330.py in <module>
      1 optimizer = ngtpy.Optimizer(log_disabled = True)
----> 2 optimizer.optimize_number_of_edges_for_anng(os.path.join(export_path, "ngt-opt"))

RuntimeError: /usr/local/include/NGT/GraphOptimizer.h:505: Optimizer::optimizeNumberOfEdgesForANNG: Cannot optimize the number of edges. 0:0.937933 # of objects=5676310

The dataset I use has 5676309 data points, and the GraphOptimizer seems to want to work on point 5676310.

Minimal code to reproduce:

ngtpy.create(os.path.join(export_path, "opt"), 128, distance_type='L2', object_type='Float16')
index = ngtpy.Index(path=os.path.join(export_path, "opt"), read_only=False, zero_based_numbering=True, log_disabled=False)

for vec in xb:
    index.insert(vec)
index.save()
index.close()

optimizer = ngtpy.Optimizer(log_disabled = True)
optimizer.optimize_number_of_edges_for_anng(os.path.join(export_path, "opt"))

Am I doing something wrong here ?
Thank you !

masajiro commented 2 years ago

Since the message shows the number of inserted objects + 1 by mistake, I don't think that it is an off-by-one error. When you specify 'Float' to the object type, do you get the same error?

fonspa commented 2 years ago

Good call, in 'Float' there's no error and the optimization process seems to work fine.

I also had a segmentation fault during the optimize_search_parameters on the same index, maybe it's related ? I'm rebuilding the index right now, I'll keep you updated if the search parameters optimization goes fine in fp32.
Update: the search parameters optimization is OK and way faster in fp32 (and doesn't end up in a segfault anyway).

So there could be a problem with fp16 and the Optimizer ?
Concerning the fp16, according to your experience, is there a tangible benefit in terms of performances to work in fp16 instead of fp32 ?
I didn't encounter yet a database where fp16 was faster than fp32.

Thank you very much for your help and your time!

masajiro commented 2 years ago

I found bugs of the optimizer with fp16 from your help. I am going to release a new version to fix them soon.

The current fp16 version mainly reduces its memory foot print. If you use a computer with AVX and build NGT on it, the search time is also shorten a little. However, when you use the ngt python package from pypi, the search time might increase, because AVX is disabled for the python packages.

fonspa commented 2 years ago

Hi, that's great, thank you for working on a fix! I will test again when it's released.

I built the NGT libraries from sources (on a CPU with AVX2) and built the python bindings from the python/ directory, by following the wiki, I believe I should get the benefits of AVX2 extensions even when building an index from Python code.

masajiro commented 2 years ago

I have just made a new release v1.14.2 with bug fixes of the optimizer relating to FP16.

masajiro commented 2 years ago

I closed this issue for now. Feel free to reopen this whenever you find any issue.