yzhao062 / pyod

A Python Library for Outlier and Anomaly Detection, Integrating Classical and Deep Learning Techniques
http://pyod.readthedocs.io
BSD 2-Clause "Simplified" License
8.56k stars 1.37k forks source link

knn metric cosine ValueError: Unrecognized metric 'cosine' #221

Open dreamflasher opened 4 years ago

dreamflasher commented 4 years ago
clf = KNN(metric="cosine")
clf.fit(train_data)

results in

sklearn/neighbors/_binary_tree.pxi in sklearn.neighbors._ball_tree.BinaryTree.__init__()

sklearn/neighbors/_dist_metrics.pyx in sklearn.neighbors._dist_metrics.DistanceMetric.get_metric()

ValueError: Unrecognized metric 'cosine'

Also brute and kd_tree don't support cosine, meaning that cosine is not supported by any algorithm.

yzhao062 commented 4 years ago

Thanks for the note. PR is welcomed :)

jackie930 commented 4 years ago

I was trying to contribute on this and did a little bit research, answer as below: Thus I am wondering is this PR still necessary?

The cosine similarity is generally defined as xT y / (||x|| * ||y||), and outputs 1 if they are the same and goes to -1 if they are completely different. This definition is not technically a metric, and so you can't use accelerating structures like ball and kd trees with it. If you force scikit learn to use the brute force approach, you should be able to use it as a distance if you pass it your own custom distance metric object. There are methods of transforming the cosine similarity into a valid distance metric if you would like to use ball trees (you can find one in the JSAT library)

Notice though, that xT y / (||x|| ||y||) = (x/||x||)T (y/||y||). The euclidean distance can be equivalently written as sqrt(xTx + yTy − 2 xTy). If we normalize every datapoint before giving it to the KNeighborsClassifier, then x^T x = 1 for all x. So the euclidean distance will degrade to sqrt(2 − 2x^T y). For completely the same inputs, we would get sqrt(2-21) = 0 and for complete opposites sqrt(2-2*-1)= 2. And it is clearly a simple shape, so you can get the same ordering as the cosine distance by normalizing your data and then using the euclidean distance. So long as you use the uniform weights option, the results will be identical to having used a correct Cosine Distance.

metesynnada commented 2 years ago

If this PR is not necessary, change the documentation and delete the "cosine" support. It contradicts with the class documentation.