yahoojapan / NGT

Nearest Neighbor Search with Neighborhood Graph and Tree for High-dimensional Data
Apache License 2.0
1.24k stars 114 forks source link

Bad performance with empty data #93

Open iharshulhan opened 3 years ago

iharshulhan commented 3 years ago

Building the index with huge number of empty vectors is very slow and may result in pure search performance. I would suggest to either handle the case separately or throw a warning to a user.

masajiro commented 3 years ago

Could you tell me your situation more? What do you mean by empty vector? Is the vector {} or {0.0, ..., 0.0}? What is the number of dimensions of the empty vector? Which distance function do you use for the empty vectors?

iharshulhan commented 3 years ago

I've ment the vector with zeros {0.0, ..., 0.0}. I've used vectors with a dimension of 500. The total number of vectors was ~3.5 million and the cosine similarity function.

I believe that it also a case for vectors with a single element like this {0, 1, ..., 0}. The index was stuck during querying time for such vectors