yahoojapan / NGT

Nearest Neighbor Search with Neighborhood Graph and Tree for High-dimensional Data
Apache License 2.0
1.25k stars 114 forks source link

index.remove(idx) - Graph::removeEdgeReliably: Lost conectivity! #64

Closed ejdibs closed 3 years ago

ejdibs commented 4 years ago

Hello,

I would like to ask a question about removing nodes from an existing index.

When I remove nodes from my index I receive errors like this: Graph::removeEdgeReliably: Lost conectivity! Isn't this ANNG? ID=2 anyway continue...

and also /NGT-1.9.1/lib/NGT/Index.h:1622: remove:: cannot remove from tree. id=5 /NGT-1.9.1/lib/NGT/Tree.h:191: VpTree::remove: Inner error. Cannot remove object. leafNode=1693:/NGT-1.9.1/lib/NGT/Node.cpp:260: VpTree::Leaf::remove: Cannot find the specified object. ID=5,0 idx=21 If the same objects were inserted into the index, ignore this message

After removing a subset of nodes from an existing index, will running index.build_index() restore connectivity?

If I receive failures when attempting to remove a node, do I need to build a new index using a dataset that excludes the node(s)?

An example of my current usage could be:

# initial creation    
# 65 thousand items added to index, index built and saved
# at a later time, we update the index

# load the existing index within a python3 application.    
index = ngtpy.Index(index_path)

# process that updates index
index.remove(0)
# Graph::removeEdgeReliably: Lost conectivity! Isn't this ANNG? ID=2 anyway continue...
index.remove(1)
index.remove(2)
/NGT-1.9.1/lib/NGT/Index.h:1622: remove:: cannot remove from tree. id=5 /NGT-1.9.1/lib/NGT/Tree.h:191: VpTree::remove: Inner error. Cannot remove object. leafNode=1693:/NGT-1.9.1/lib/NGT/Node.cpp:260: VpTree::Leaf::remove: Cannot find the specified object. ID=5,0 idx=21 If the same objects were inserted into the index, ignore this message

# in my test I removed the first 10 nodes I previously added when building the index

# at end of update process  
index.build_index()
index.save()

# Is connectivity repaired for our index? Were all nodes removed successfully? 

My current test index is 65 thousand nodes. My use case is to create an index of 3-5 millions nodes, periodically remove a small subset of nodes and add new nodes. I am hoping to avoid inserting millions of nodes, rebuilding a fresh index every time.

Issues that cite adding and removing nodes support:

https://github.com/yahoojapan/NGT/issues/38
https://github.com/yahoojapan/NGT/issues/19

Thank you for your time and consideration. Please let me know if there is any additional information that would help address my question.

Regards, Erik

masajiro commented 4 years ago

Hello,

First, I am wondering why the inconsistency occurred. Do you have any ideas about that. If not, could you answer the following questions?

What is your OS? Did you install your ngtpy from PYPI, or install it with setup.py? Did you build and install NGT (not ngtpy) with a shared memory option? Did you reconstruct ONNG from a default index with the optimizer? Did you insert and remove by multiple threads without locking? Could you run the below command to check your index and send the result? ngt info -m a [index-folder]

We use NGT for our company's services to build an index for more than 1 M objects and continuously update the index. However, inconsistency has not occurred except the beginning of the service. At this moment, there is no functions to fix inconsistency of indexes. Since I understand the demand, I will consider implementing such functions.

ejdibs commented 4 years ago

Hello,

Thank you for your response. I believe that I was able to find the cause of the inconsistency.

I was creating a tar.gz of the index and backing it up after executing these methods

index.build_index()
index.save()

Running the aforementioned sequence of methods, before creating an archive of the index, was a reliable way to create an inconsistent index.

I reread the ngtpy API and reviewed some of the source code and noticed the close() method.

I updated my logic to close the index, before I attempt to archive the index directory.

index.build_index()
index.save()
index.close()

I have found that closing the index before creating an archive prevents the inconsistency.

I am now able to load and modify archived indexes without errors.

Thank you for taking the time to communicate with me. If you have any other questions, please let me know.

Regards, Erik

masajiro commented 4 years ago

Hello, If you use NGT built with the shared memory option, close is needed to achieve. However, if you use NGT without the shared memory option, I think that no close just causes memory leaks. Which one do you use?

ejdibs commented 4 years ago

I build NGT like this:

cd NGT-1.9.1 \
    && mkdir build \
    && cd build \
    && cmake .. \
    && make \
    && make install \
    && ldconfig /usr/local/lib

I believe that this excludes the shared memory option.

For now closing the index before archiving data is fine for my use case. But if this is useful information and you have additional questions, please let me know.

masajiro commented 4 years ago

Thank you for your information. It seems you do not use shared NGT.

Although I tried various things, I was not able to reproduce the bug. If you have time, could you reproduce an index (an archive) from which you cannot remove objects and send it?

BTW, after only removing objects, you do not need to call build_index. remove() updates indexes as well.