yahoojapan / NGT

Nearest Neighbor Search with Neighborhood Graph and Tree for High-dimensional Data
Apache License 2.0
1.22k stars 112 forks source link

Custom distances #21

Open MilesCranmer opened 5 years ago

MilesCranmer commented 5 years ago

Hi, I was wondering if you were planning to add support for custom distance_type's in the python API? I have a complicated function in python that I wish to use as a custom distance metric and am interested in the scalability of this package.

For example, I would like to set distance_type=func. As with sklearn's NearestNeighbors, the function typically has arguments for each point input (can be input as 2D numpy vectors). Would this work for NGT?

def func(x1, x2): #L1
    return np.sum(np.abs(x1-x2), axis=1)
masajiro commented 5 years ago

Currently I have no plan to implement a custom distance. But, since I know that a custom distance is important for some applications, I would like to implement a custom distance into C++ API in future. But, I do not think that a custom distance defined by slow python is an effective way, because the distance function largely affects the search time.

MilesCranmer commented 5 years ago

Thanks. Just to check, if I want to add a custom distance myself in the C++ without adding all the architecture for it, could I potentially just replace the following lines: https://github.com/yahoojapan/NGT/blob/79ba11c2ddec2554976d47fdfb465848dd5bf2b8/lib/NGT/PrimitiveComparator.h#L48-L65 and refer to my custom distance as "l2" ? (My distance is similar to L2) Or would there be a lot more I would need to replace?

Cheers, Miles

masajiro commented 5 years ago

You are right. You do not need to update other part of the source code. But you have to care about implementing your function. Objects (vector data) are automatically stored as a 16-byte boundary object with zero padding in the index. https://github.com/yahoojapan/NGT/blob/79ba11c2ddec2554976d47fdfb465848dd5bf2b8/lib/NGT/PrimitiveComparator.h#L48 The size above is the byte length of the 16-byte boundary object. You cannot get the original size of the object in the function. For example, when you specify 3-dimensional floating point object to create an empty index, the size of the function’s parameter is 16 and the added forth floating point of the object is always zero.

masajiro commented 5 years ago

If you use the default cmake definitions, you have to replace https://github.com/yahoojapan/NGT/blob/79ba11c2ddec2554976d47fdfb465848dd5bf2b8/lib/NGT/PrimitiveComparator.h#L78 or https://github.com/yahoojapan/NGT/blob/79ba11c2ddec2554976d47fdfb465848dd5bf2b8/lib/NGT/PrimitiveComparator.h#L118 , because AVX must be available.

raulcarlomagno commented 3 years ago

is there any plan to add support for braycurtis and canberra distances? thanks

masajiro commented 3 years ago

We have no plan to implement braycurtis and canberra distances. However, it might not be so difficult to implement the distances by yourself referring to this PR #91.