Open mbongaerts opened 2 years ago
No, the behaviror is expected. Outlier detection (OD) is indeed a special ML field that train/test split is not always possible.
For instance, given there is no label for evaluation, the detection process is always one shot, which is transductive. There is no test data.
Under this circumstances, it is general idea that the "prediction on the new data" should be combined with the "so-called" train data and detect again. That is how ECOD works.
However, for some algorithms, it is possible to do cheaper "prediction" by keeping this key information for prediction. Whether to retain the train data depends on the algorithm itself.
Hope this helps.
Similar to my comment in https://github.com/yzhao062/pyod/issues/395 I would suggest to change the docsstring: https://github.com/yzhao062/pyod/blob/5f5064ef9145130a31ce5510cee44dee520cb119/pyod/models/ecod.py#L128
where 'fitted detector' should be removed, since this could mislead the user by thinking learning parameters were previously learned from the train set.
Another question/concerns involves the following line: https://github.com/yzhao062/pyod/blob/5f5064ef9145130a31ce5510cee44dee520cb119/pyod/models/ecod.py#L145
What will be the behavior of this method, when you pass X_train
also via decision_function()
? In other words, you concatenate the train set twice, which results in duplicated rows/samples. I am note sure if this is a type of behavior you want to allow. I am happy to hear from you.
Kind regards.
Dear contributor(s),
I have a question with respect to the implementation of the ECOD algorithm.
It seems like (if I am not mistaken) that this implementation is mixing train set and test set, when
decision_function(X)
is called. So, during fitting the train set is nicely stored: https://github.com/yzhao062/pyod/blob/5f5064ef9145130a31ce5510cee44dee520cb119/pyod/models/ecod.py#L123When a test set in used to obtain the outlier scores,
X_train
gets concatenated: https://github.com/yzhao062/pyod/blob/5f5064ef9145130a31ce5510cee44dee520cb119/pyod/models/ecod.py#L145In the next steps, it looks like previously fitted parameters (when calling
fit()
) are overwritten by newly obtained parameters based on the concatenatedX
(train set+ test set): https://github.com/yzhao062/pyod/blob/5f5064ef9145130a31ce5510cee44dee520cb119/pyod/models/ecod.py#L127-L160This behavior seems to be wrong, since now test set and train set are not nicely separated, which in general should be the case. I would be happy to receive some clarification about his.
Kind regards