yahoojapan / NGT

Nearest Neighbor Search with Neighborhood Graph and Tree for High-dimensional Data
Apache License 2.0
1.26k stars 115 forks source link

Remove vector causes error, only for "Normalised" distance types: Index.h:remove:1544: Not found the specified id #152

Closed cjrh closed 1 year ago

cjrh commented 1 year ago

Summary

When calling index.remove(<id>), an error occurs, but only with the following distance types:

I first reported this bug on the ngt-rs project, but following comments by @lerouxrgd I decided to make a reproducer here. Note that since the same errors occur on that project, it is impossible that the error is in the Python wrapper.

Reproducer

Here is a small file to reproduce the error. To setup the venv and install deps:

$ python3.10 -m venv venv
$ source venv/bin/activate
(venv) $ pip install -U pip wheel ngt

These are the versions I have installed in my venv:

$ pip list
Package    Version
---------- -------
ngt        2.1.3
numpy      1.26.1
pip        23.3.1
pybind11   2.11.1
setuptools 59.6.0
wheel      0.41.3

To run the file below, simply:

(venv) $ python main.py

Here is the file below. The important part is the function run_test().

import ngtpy
import traceback

def run_test(distance_type: str):
    # Create the index
    ngtpy.create(b"tmp", 3, distance_type=distance_type)
    index = ngtpy.Index(b"tmp", zero_based_numbering=False)

    # Insert vectors
    id1 = index.insert([1.0, 2.0, 3.0])
    id2 = index.insert([40.0, 5.0, 6.0])
    index.build_index(2)
    index.save()

    # Search once
    query = [1.01, 1.02, 1.03]
    results = index.search(query, 1)
    assert results[0][0] == id1

    # Remove a vector - PROBLEM
    index.remove(id1)

    # Search again
    results = index.search(query, 1)
    assert results[0][0] == id2

distance_types = [
    "L1",
    "L2",
    "Normalized L2",
    "Angle",
    "Normalized Angle",
    "Cosine",
    "Normalized Cosine",
    "Hamming",
    "Jaccard",
]

for distance_type in distance_types:
    try:
        print()
        run_test(distance_type)
    except Exception as e:
        print(f"{distance_type=:<20}Error: {e}")
        traceback.print_exc()
    else:
        print(f"{distance_type=:<20}OK")
    finally:
        # Clean up the temporary directory
        import pathlib, shutil
        p = pathlib.Path("./tmp")
        if p.exists():
            shutil.rmtree(p)

This produces the following output:

distance_type=L1                  OK

distance_type=L2                  OK

distance_type=Normalized L2       Error: /NGT/release/NGT/lib/NGT/Index.h:remove:1544: Not found the specified id
Traceback (most recent call last):
  File "/home/caleb/tmp/ngtbugpy/main.py", line 43, in <module>
    run_test(distance_type)
  File "/home/caleb/tmp/ngtbugpy/main.py", line 22, in run_test
    index.remove(id1)
RuntimeError: /NGT/release/NGT/lib/NGT/Index.h:remove:1544: Not found the specified id

distance_type=Angle               OK

distance_type=Normalized Angle    Error: /NGT/release/NGT/lib/NGT/Index.h:remove:1544: Not found the specified id
Traceback (most recent call last):
  File "/home/caleb/tmp/ngtbugpy/main.py", line 43, in <module>
    run_test(distance_type)
  File "/home/caleb/tmp/ngtbugpy/main.py", line 22, in run_test
    index.remove(id1)
RuntimeError: /NGT/release/NGT/lib/NGT/Index.h:remove:1544: Not found the specified id

distance_type=Cosine              OK

distance_type=Normalized Cosine   Error: /NGT/release/NGT/lib/NGT/Index.h:remove:1544: Not found the specified id
Traceback (most recent call last):
  File "/home/caleb/tmp/ngtbugpy/main.py", line 43, in <module>
    run_test(distance_type)
  File "/home/caleb/tmp/ngtbugpy/main.py", line 22, in run_test
    index.remove(id1)
RuntimeError: /NGT/release/NGT/lib/NGT/Index.h:remove:1544: Not found the specified id

distance_type=Hamming             OK

distance_type=Jaccard             OK

Comments

cjrh commented 1 year ago

Note the results are the same if zero_based_numbering is set to True.

masajiro commented 1 year ago

Thank you for your feedback. I was able to reproduce the issue you mentioned. However, it will take a little while to fix it, because of its complicated situation.

masajiro commented 1 year ago

I have released v2.1.5 to resolve this issue.

lerouxrgd commented 1 year ago

I have integrated this change to ngt-rs (on master branch), however it looks like there is still an issue with Lorenz and SparseJaccard as I get the following error:

cargo test

test ngt::index::tests::test_dist_sparse_jaccard ... FAILED
test ngt::index::tests::test_dist_lorentz ... FAILED

failures:

---- ngt::index::tests::test_dist_sparse_jaccard stdout ----
Error: Error("Capi : ngt_insert_index_as_float() : Error: /home/rgd/dev/projects/ngt-rs/ngt-sys/NGT/lib/NGT/ObjectRepository.h:allocatePersistentObject:345: ObjectSpace::allocatePersistentObject: Fatal error! The dimensionality is invalid. The specified dimensionality=3. The specified object=2.")

---- ngt::index::tests::test_dist_lorentz stdout ----
thread 'ngt::index::tests::test_dist_lorentz' panicked at src/ngt/index.rs:728:23:
index out of bounds: the len is 0 but the index is 0
masajiro commented 1 year ago

These two distances are different from other distances. When using distances, the given vector data must follow the specific manners for the two distances. You may want you to remove the two distances from your test.

lerouxrgd commented 1 year ago

Thanks for the precision, I have removed them from the test and released the fix.

cjrh commented 1 year ago

Thanks everyone 👏🏼