yzhao062 / pyod

A Python Library for Outlier and Anomaly Detection, Integrating Classical and Deep Learning Techniques
http://pyod.readthedocs.io
BSD 2-Clause "Simplified" License
8.57k stars 1.37k forks source link

outlier score highly correlated to over distance to points of origin #64

Closed flycloudking closed 5 years ago

flycloudking commented 5 years ago

I calculated the distance of each data points to origins at 0, by use 'np.linalg.norm(x)', while x is just one multi-variate sample, then normalize all these values to 0-1, I called this 'global_score'. When I compare the global score to scores from different methods, it turns out it's highly correlated (0.99) with PCA, autoencoder, CBLOF, KNN. So it seems all these methods are just calculating the overall distance of the samples, instead of anomalies from multiple clusters. I was very troubled by this fact and hope you can confirm whether this is true and if it is, what's the reason for this.

Thanks

yzhao062 commented 5 years ago

could you share the code and how to reproduce it? I guess it is not like how you view it as authoencoder should not even behave that way... Thanks, Yue

flycloudking commented 5 years ago

Hi Yue, my code is as follow:the input data frame is just a multiple variable data, such as spending amount in 20 different categories, then I did the standardization to 0 mean and 1 variance, the overall_dist is just the distance from each data points to 0, then normalized by max and min to 0-1. 

remove zero variance column, this could happen in subset of datadf = df.loc[:, df.var() > 0.0]

scaler = StandardScaler()df_scaled = scaler.fit(df).transform(df) overall_dist = [np.linalg.norm(x) for x in df_scaled]overall_dist = (overall_dist - min(overall_dist)) / (max(overall_dist) - min(overall_dist))

Then df_scaled is used for all methods in PyOD for scoring. As it turns out, this overall_dist is highly correlated with several methods score. I was surprised as I think all the methods should be able to detect local anomaly, instead, they all seem simply identify global anomaly. I hope this is clear. I haven't tested this on the sample data in PyOD. Thanks Sean

On Tuesday, April 2, 2019, 9:10:14 PM EDT, Yue Zhao <notifications@github.com> wrote:  

could you share the code and how to reproduce it? I guess it is not like how you view it as authoencoder should not even behave that way... Thanks, Yue

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

yzhao062 commented 5 years ago

It is still impossible to reproduce the result. What about this?

    from pyod.utils.data import generate_data

    contamination = 0.1  # percentage of outliers
    n_train = 1000  # number of training points
    n_test = 100  # number of testing points

    # Generate sample data
    X_train, y_train, X_test, y_test = \
        generate_data(n_train=n_train,
                      n_test=n_test,
                      n_features=20,
                      contamination=contamination,
                      random_state=42)

The code above will create some numerical data for you. X_train in shape [1000, 20]. Then you could add your code below to show how you process it. Please respond in GitHub otherwise the format is gone.

flycloudking commented 5 years ago

Hi Yue,

Please see code below, just add library import, i use seaborn for plot. This time, the score are negatively correlated, not sure why. but the correlation is strong.

contamination = 0.1 # percentage of outliers n_train = 1000 # number of training points n_test = 100 # number of testing points

Generate sample data

X_train, y_train, X_test, y_test = \ generate_data(n_train=n_train, n_test=n_test, n_features=20, contamination=contamination, random_state=42)

print(X_train.shape)

overall_dist = [np.linalg.norm(x) for x in X_train] overall_dist = (overall_dist - min(overall_dist)) / (max(overall_dist) - min(overall_dist))

clf_name = 'PCA' clf = PCA() clf.fit(X_train)

normalize the score to 0-1

mm_score = (clf.decisionscores - clf.decisionscores.min()) / (clf.decisionscores.max() - clf.decisionscores.min())

get the prediction label and outlier scores of the training data

score_df = pd.DataFrame({clf_name+'outlier' : clf.labels, clf_name+'_score' : mm_score.flatten() })

score_df['global_score'] = overall_dist

print(dist_df.ABOD_outlier.value_counts(dropna=False))

display(score_df.head())

print(score_df.global_score.corr(score_df[clf_name+'_score'])) sns.jointplot(x='global_score', y=clf_name+'_score', data=score_df)

yzhao062 commented 5 years ago

I just created an example for you. Download it here.

In this example, I checked the Pearson correlation between the distance and the scores of PCA, IForest, and KNN, the correlations are not that high when the dimension is low (d==2). When the dimension is high, there could be a high correlation. This is an interesting observation but I guess it is caused by the nature of the datasets...it is a simple dataset and the outlier pattern is clear.

So I provide you another example (line 41-67) with real data. You could comment out line 22-37 to run this. You could see this high correlation thing is not that serious on real-world datasets, so the phenomenon is data dependent.

Hope this helps.

flycloudking commented 5 years ago

I also realized that standardize the data has big impact on the results, simple add

X_train = StandardScaler().fit_transform(X_train)

before anomaly detection will make the correlation even higher, maybe the simulated data are not supposed to be standardized.

yzhao062 commented 5 years ago

Yeah. So I believe there is no need to worry about the high correlation, which is data dependent. I will close this issue if you are happy with it :)

flycloudking commented 5 years ago

Not totally, I understand it may depend on data. I have tested on several my data sets, these are real spending data in different countries. some countries have reasonable clusters, some countries has no good clusters at all. All of them show high correlation between the global distance to PCA, autoencoder method. I guess I was a bit disappointed as I was hoping autoencoder may provide a better method for anomaly detection, and it turns out just the same as a distance measure. Anyway, thanks for spending time investigate this. I appreciate your help.

flycloudking commented 5 years ago

Can I ask another favor?  I have used HDBscan method https://hdbscan.readthedocs.io/en/latest/index.html for anomaly detection, do you think you can add the method to PyOD?  Thanks Sean

The hdbscan Clustering Library — hdbscan 0.8.1 documentation

|

|

|

On Sunday, April 7, 2019, 3:46:37 PM EDT, Yue Zhao <notifications@github.com> wrote:  

Yeah. So I believe there is no need to worry about the high correlation, which is data dependent. I will close this issue if you are happy with it :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

yzhao062 commented 5 years ago

Cool. I will close this thread. but feel free to open a separate one for feature request :)