yzhao062 / pyod

A Python Library for Outlier and Anomaly Detection, Integrating Classical and Deep Learning Techniques
http://pyod.readthedocs.io
BSD 2-Clause "Simplified" License
8.55k stars 1.37k forks source link

ECOD mixes train set and test set #401

Open mbongaerts opened 2 years ago

mbongaerts commented 2 years ago

Dear contributor(s),

I have a question with respect to the implementation of the ECOD algorithm.

It seems like (if I am not mistaken) that this implementation is mixing train set and test set, when decision_function(X) is called. So, during fitting the train set is nicely stored: https://github.com/yzhao062/pyod/blob/5f5064ef9145130a31ce5510cee44dee520cb119/pyod/models/ecod.py#L123

When a test set in used to obtain the outlier scores, X_train gets concatenated: https://github.com/yzhao062/pyod/blob/5f5064ef9145130a31ce5510cee44dee520cb119/pyod/models/ecod.py#L145

In the next steps, it looks like previously fitted parameters (when calling fit()) are overwritten by newly obtained parameters based on the concatenated X (train set+ test set): https://github.com/yzhao062/pyod/blob/5f5064ef9145130a31ce5510cee44dee520cb119/pyod/models/ecod.py#L127-L160

This behavior seems to be wrong, since now test set and train set are not nicely separated, which in general should be the case. I would be happy to receive some clarification about his.

Kind regards

yzhao062 commented 2 years ago

No, the behaviror is expected. Outlier detection (OD) is indeed a special ML field that train/test split is not always possible.

For instance, given there is no label for evaluation, the detection process is always one shot, which is transductive. There is no test data.

Under this circumstances, it is general idea that the "prediction on the new data" should be combined with the "so-called" train data and detect again. That is how ECOD works.

However, for some algorithms, it is possible to do cheaper "prediction" by keeping this key information for prediction. Whether to retain the train data depends on the algorithm itself.

Hope this helps.

mbongaerts commented 2 years ago

Similar to my comment in https://github.com/yzhao062/pyod/issues/395 I would suggest to change the docsstring: https://github.com/yzhao062/pyod/blob/5f5064ef9145130a31ce5510cee44dee520cb119/pyod/models/ecod.py#L128

where 'fitted detector' should be removed, since this could mislead the user by thinking learning parameters were previously learned from the train set.

Another question/concerns involves the following line: https://github.com/yzhao062/pyod/blob/5f5064ef9145130a31ce5510cee44dee520cb119/pyod/models/ecod.py#L145

What will be the behavior of this method, when you pass X_train also via decision_function()? In other words, you concatenate the train set twice, which results in duplicated rows/samples. I am note sure if this is a type of behavior you want to allow. I am happy to hear from you.

Kind regards.