Closed flycloudking closed 5 years ago
could you share the code and how to reproduce it? I guess it is not like how you view it as authoencoder should not even behave that way... Thanks, Yue
Hi Yue, my code is as follow:the input data frame is just a multiple variable data, such as spending amount in 20 different categories, then I did the standardization to 0 mean and 1 variance, the overall_dist is just the distance from each data points to 0, then normalized by max and min to 0-1.
scaler = StandardScaler()df_scaled = scaler.fit(df).transform(df) overall_dist = [np.linalg.norm(x) for x in df_scaled]overall_dist = (overall_dist - min(overall_dist)) / (max(overall_dist) - min(overall_dist))
Then df_scaled is used for all methods in PyOD for scoring. As it turns out, this overall_dist is highly correlated with several methods score. I was surprised as I think all the methods should be able to detect local anomaly, instead, they all seem simply identify global anomaly. I hope this is clear. I haven't tested this on the sample data in PyOD. Thanks Sean
On Tuesday, April 2, 2019, 9:10:14 PM EDT, Yue Zhao <notifications@github.com> wrote:
could you share the code and how to reproduce it? I guess it is not like how you view it as authoencoder should not even behave that way... Thanks, Yue
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
It is still impossible to reproduce the result. What about this?
from pyod.utils.data import generate_data
contamination = 0.1 # percentage of outliers
n_train = 1000 # number of training points
n_test = 100 # number of testing points
# Generate sample data
X_train, y_train, X_test, y_test = \
generate_data(n_train=n_train,
n_test=n_test,
n_features=20,
contamination=contamination,
random_state=42)
The code above will create some numerical data for you. X_train in shape [1000, 20]. Then you could add your code below to show how you process it. Please respond in GitHub otherwise the format is gone.
Hi Yue,
Please see code below, just add library import, i use seaborn for plot. This time, the score are negatively correlated, not sure why. but the correlation is strong.
contamination = 0.1 # percentage of outliers n_train = 1000 # number of training points n_test = 100 # number of testing points
X_train, y_train, X_test, y_test = \ generate_data(n_train=n_train, n_test=n_test, n_features=20, contamination=contamination, random_state=42)
print(X_train.shape)
overall_dist = [np.linalg.norm(x) for x in X_train] overall_dist = (overall_dist - min(overall_dist)) / (max(overall_dist) - min(overall_dist))
clf_name = 'PCA' clf = PCA() clf.fit(X_train)
mm_score = (clf.decisionscores - clf.decisionscores.min()) / (clf.decisionscores.max() - clf.decisionscores.min())
score_df = pd.DataFrame({clf_name+'outlier' : clf.labels, clf_name+'_score' : mm_score.flatten() })
score_df['global_score'] = overall_dist
display(score_df.head())
print(score_df.global_score.corr(score_df[clf_name+'_score'])) sns.jointplot(x='global_score', y=clf_name+'_score', data=score_df)
I just created an example for you. Download it here.
In this example, I checked the Pearson correlation between the distance and the scores of PCA, IForest, and KNN, the correlations are not that high when the dimension is low (d==2). When the dimension is high, there could be a high correlation. This is an interesting observation but I guess it is caused by the nature of the datasets...it is a simple dataset and the outlier pattern is clear.
So I provide you another example (line 41-67) with real data. You could comment out line 22-37 to run this. You could see this high correlation thing is not that serious on real-world datasets, so the phenomenon is data dependent.
Hope this helps.
I also realized that standardize the data has big impact on the results, simple add
X_train = StandardScaler().fit_transform(X_train)
before anomaly detection will make the correlation even higher, maybe the simulated data are not supposed to be standardized.
Yeah. So I believe there is no need to worry about the high correlation, which is data dependent. I will close this issue if you are happy with it :)
Not totally, I understand it may depend on data. I have tested on several my data sets, these are real spending data in different countries. some countries have reasonable clusters, some countries has no good clusters at all. All of them show high correlation between the global distance to PCA, autoencoder method. I guess I was a bit disappointed as I was hoping autoencoder may provide a better method for anomaly detection, and it turns out just the same as a distance measure. Anyway, thanks for spending time investigate this. I appreciate your help.
Can I ask another favor? I have used HDBscan method https://hdbscan.readthedocs.io/en/latest/index.html for anomaly detection, do you think you can add the method to PyOD? Thanks Sean
The hdbscan Clustering Library — hdbscan 0.8.1 documentation
|
|
|
On Sunday, April 7, 2019, 3:46:37 PM EDT, Yue Zhao <notifications@github.com> wrote:
Yeah. So I believe there is no need to worry about the high correlation, which is data dependent. I will close this issue if you are happy with it :)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Cool. I will close this thread. but feel free to open a separate one for feature request :)
I calculated the distance of each data points to origins at 0, by use 'np.linalg.norm(x)', while x is just one multi-variate sample, then normalize all these values to 0-1, I called this 'global_score'. When I compare the global score to scores from different methods, it turns out it's highly correlated (0.99) with PCA, autoencoder, CBLOF, KNN. So it seems all these methods are just calculating the overall distance of the samples, instead of anomalies from multiple clusters. I was very troubled by this fact and hope you can confirm whether this is true and if it is, what's the reason for this.
Thanks