wmayner / pyemd

Fast EMD for Python: a wrapper for Pele and Werman's C++ implementation of the Earth Mover's Distance metric
MIT License
479 stars 62 forks source link

API method for EMD on two arrays? #25

Closed scottgigante closed 6 years ago

scottgigante commented 6 years ago

Hello,

I believe users would find it useful to have a built in method for calculating the EMD statistic on two arrays without having to build the histograms and distance matrix.

I've written a simple function to do this myself - I'm happy to write it up properly and submit a pull request if you're happy to incorporate it into the API.

from pyemd import emd
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances

def array_emd(x, y, bins=30):
    xy_min = min(np.min(x), np.min(y))
    xy_max = max(np.max(x), np.max(y))
    x_hist, bin_edges = np.histogram(x, range=(xy_min, xy_max), bins=bins)
    y_hist, _ = np.histogram(y, range=(xy_min, xy_max), bins=bins)

    bin_middles = np.mean([bin_edges[:-1], bin_edges[1:]], axis=0)
    x_hist = x_hist.astype(dtype="float64")
    y_hist = y_hist.astype(dtype="float64")
    dist_mat = pairwise_distances(bin_middles.reshape(-1, 1))
    return emd(x_hist, y_hist, dist_mat)
wmayner commented 6 years ago

Thanks for the suggestion! Sure, I'd be happy to look at a PR for creating histograms automatically.

However, I'm reluctant to add dependencies like sklearn or scipy, as these are quite heavy. If you feel that creating the distance matrix automatically is crucial, it may make sense to create a separate package that implements this instead, with pyemd as a dependency.

Also, in your function, you would probably want to add a metric='euclidean' keyword argument that gets passed to pairwise_distances, in case the user wants a different metric.

scottgigante commented 6 years ago

Both good points. I would happily drop those dependencies as a pairwise distance matrix is pretty easy to implement - however, without scipy I would probably only include euclidean distance. Perhaps an option to pass in a custom metric function (e.g. partial(scipy.spatial.distance.pdist, metric='cosine')) with default being an internal implementation of euclidean distance would be best?

wmayner commented 6 years ago

That sounds great. For the Euclidean distance default, you can just use numpy.linalg.norm(x, y).

zhouzhouha commented 3 months ago

Also, in your function, you would probably want to add a metric='euclidean' keyword argument that gets passed to pairwise_distances, in case the user wants a different metric.

So, here I have a question about the distance matrix: is this distance matrix has to be the distance between the middle points of the bins? Can the distance be defined as the distance between the centroid of the corresponding bins? In my case, I am using a colored point cloud, which means I create the histogram by the intensity (color) of the point cloud, and I would like to apply the spatial information as well, so can I use the Euclidean distance between the centroid points in corresponding bins (x1-y1)^2 + (x2-y2)^2 + (x3-y3)^2