yzhao062 / pyod

A Python Library for Outlier and Anomaly Detection, Integrating Classical and Deep Learning Techniques
http://pyod.readthedocs.io
BSD 2-Clause "Simplified" License
8.56k stars 1.37k forks source link

Using IForest in situations where training set does not contain any anomalies #482

Open SaVoAMP opened 1 year ago

SaVoAMP commented 1 year ago

Hey,

I was reading the original paper on Isolation Forests. There the authors state, that

iForest also works well in high dimensional problems ... and in situations where training set does not contain any anomalies.

Also they write in the "Empirical Evaluation" part that

It is assumed that anomaly labels are unavailable in the training stage. Anomaly labels are only available in the evaluation stage to compute the performance measure, AUC.

and the paper contains a section called "Training using normal instances only".

Since I was also trying to train an Isolation Forest without any anomalies, I was wondering why the contamination parameter of the IForest model needs to be in the interval (0., 0.5], where the case of zero anomalies is excluded. I tried to work around the problem by setting a very small value (close to 0) for the contamination. But then another problem arises:

MicrosoftTeams-image

Is there a way around the problem so that it is possible to train exclusively on normal data that does not contain any anomalies?

yzhao062 commented 1 year ago

The error appears is because your ground truth labels for evaluation are all 0 or 1...there is nothing to evaluate. There is nothing to do with iforest here. If you remove the evaluation function it should run.

SaVoAMP commented 1 year ago

All right, makes sense, thank you!

But then why isn't it allowed to set contamination=0 if I'm training without any anomalies?

yzhao062 commented 1 year ago

All right, makes sense, thank you!

But then why isn't it allowed to set contamination=0 if I'm training without any anomalies?

it is just for convenience. contamination=0 will make the following outlier label generation process confuse. again, contamination is not used in running the detection algorithm.

SaVoAMP commented 1 year ago

I'm very sorry, but I'm still confused. Would you mind to explain it for me?

I thought that contamination is used to define the threshold on the decision function and the threshold is calculated for generating the binary outlier labels. Doesn't this automatically mean that the choice of 'contamination' determines the number of anomalies to be found? Also, I have read in many tutorials on Isolation Forests that the performance of the algorithm depends very much on the choice of the 'contamination' parameter. However, I could not read anything of the sort in the original paper. Besides the training data, there are only two input parameters to the IForest algorithm: the subsampling size (corresponds to the max_samples parameter of pyod, I guess) and the number of trees (probably n_estimators). The isolation trees are built using randomly selected sub-samples of the given training data and recursively dividing them by randomly selecting an attribute and a split value until either the node has only one instance or all data at the node have the same values. At the end of the training process, a collection of trees - the forest - is returned. Then test instances are passed through the isolation trees in the evaluation stage to obtain an anomaly score for each instance. This anomaly score is calculated via the average path length over a number of trees and therefore when a forest of random trees collectively produce shorter path lengths for particular samples, they are probably anomalies.

In this context, however, I don't understand what you mean by

The anomaly score of an input sample is computed based on different detector algorithms.

in your explanation of decision_function(X). What are the "different detector algorithms" and why is the contamination then not used in running the detection algorithm?

Sorry these are probably pretty stupid questions, however I'm pretty confused right now.

Lucew commented 1 year ago

If I may also share some thoughts on this:

IMHO it is important to distinguish between training, scoring, and decision in the case of outlier detectors and this case, especially for isolation forests.

  1. In the first stage of training the classifier learns some parameters from the given data to fit its internal model.

  2. In the second stage, the classifier uses these parameters to compute an outlier score (could be a probability but could also be depth in the tree as in this case).

  3. In the third stage, the classifier needs to make a binary decision about whether something is an outlier or not. In most cases, this is done using a threshold after which a sample with a corresponding score higher/lower than the threshold is termed an outlier.

In most algorithms in this package, this threshold for the last stage is automatically generated from the training data as well, using the contamination parameter. Essentially this is done, by setting the threshold to a value so that (number of samples)*contamination samples become outliers.

In the case of isolation forests, setting the contamination does not change the forest learned during training at all (stage one) as well as the scoring (stage 2). It only affects the threshold for outlier prediction. One can also infer this threshold using the same tactic for new data or tell the algorithm an absolute value it should use to consider a sample an outlier.

You can get the score of stage 2 using the decision_function and then do thresholding yourself. If I'm not mistaken this is also the varying parameter along which the ROC-AUC is computed for Isolation-Forests.

The definition of the contamination being (0, 05] is in line with the sklearn documentation of isolation forests, which PyOD is using under the hood.

Setting this contamination level to zero would not make sense in the case a user also wants to use the model for inference, as the decision (stage 3) would then never detect outliers as the threshold would be infinity.

TL;DR: Contamination does not affect training nor scoring of Isolation Forests. It only affects the binary decision, which can also be done using some user-defined threshold.

SaVoAMP commented 1 year ago

Hey, thank you very much first of all for the answer!

In the case of isolation forests, setting the contamination does not change the forest learned during training at all (stage one) as well as the scoring (stage 2). It only affects the threshold for outlier prediction. One can also infer this threshold using the same tactic for new data or tell the algorithm an absolute value it should use to consider a sample an outlier.

Now I understand what yzhao062 meant!

TL;DR: Contamination does not affect training nor scoring of Isolation Forests. It only affects the binary decision, which can also be done using some user-defined threshold.

All right. When the Isolation Forest authors report as that their algorithm works even if the training set contains no anomalies at all, they don't mean that you don't need the contamination at all, right? The first two stages work without the contamination parameter, but for the binary decision of what to count as an outlier and what to count as a normal instance (in the 3rd stage), you need prior knowledge about the contamination? And this is where (if I have understood everything correctly now) the problem arises that the algorithm is so sensitive to this parameter?

Lucew commented 1 year ago

Hey, thank you very much first of all for the answer!

Glad I can help!

All right. When the Isolation Forest authors report as that their algorithm works even if the training set contains no anomalies at all, they don't mean that you don't need the contamination at all, right?

At least that's how I think about it, yes. That is also in line with them only reporting AUC for their results.

The first two stages work without the contamination parameter, but for the binary decision of what to count as an outlier and what to count as a normal instance (in the 3rd stage), you need prior knowledge about the contamination?

Exactly! This is something, I'm pretty sure about as this is also provable by looking at the code of sklearn (a very reliable package in terms of implementations). But keep in mind, that you could also make your own threshold. This can even be some kind of adaptive threshold. E.g. you have samples coming in over time and you decide to label all sequential k samples outliers if their combined score is higher than a threshold. You could even apply random noise in every dimension of a sample and check how stable the score is and then apply the threshold etc. etc. Thresholding can be arbitrarily complex. There is even packages like PyThresh for that.

And this is where (if I have understood everything correctly now) the problem arises that the algorithm is so sensitive to this parameter?

I imagine so, yes. If you think about the adaption of the threshold as walking along the ROC-Curve, there might be huge jumps along the curve.

SaVoAMP commented 1 year ago

Thank you very much, that helped me a lot! :)

samuel01028 commented 10 months ago

If I may also share some thoughts on this:

IMHO it is important to distinguish between training, scoring, and decision in the case of outlier detectors and this case, especially for isolation forests.

  1. In the first stage of training the classifier learns some parameters from the given data to fit its internal model.
  2. In the second stage, the classifier uses these parameters to compute an outlier score (could be a probability but could also be depth in the tree as in this case).
  3. In the third stage, the classifier needs to make a binary decision about whether something is an outlier or not. In most cases, this is done using a threshold after which a sample with a corresponding score higher/lower than the threshold is termed an outlier.

In most algorithms in this package, this threshold for the last stage is automatically generated from the training data as well, using the contamination parameter. Essentially this is done, by setting the threshold to a value so that (number of samples)*contamination samples become outliers.

In the case of isolation forests, setting the contamination does not change the forest learned during training at all (stage one) as well as the scoring (stage 2). It only affects the threshold for outlier prediction. One can also infer this threshold using the same tactic for new data or tell the algorithm an absolute value it should use to consider a sample an outlier.

You can get the score of stage 2 using the decision_function and then do thresholding yourself. If I'm not mistaken this is also the varying parameter along which the ROC-AUC is computed for Isolation-Forests.

The definition of the contamination being (0, 05] is in line with the sklearn documentation of isolation forests, which PyOD is using under the hood.

Setting this contamination level to zero would not make sense in the case a user also wants to use the model for inference, as the decision (stage 3) would then never detect outliers as the threshold would be infinity.

TL;DR: Contamination does not affect training nor scoring of Isolation Forests. It only affects the binary decision, which can also be done using some user-defined threshold.

However, I tested and found out that different value of contamination would cause the different scores with the dicision function.

samuel01028 commented 10 months ago

still confused, may be it would worked good with any other algorithms.

yzhao062 commented 10 months ago

I believe that is due to the built in randomness of iforest. It has nothing to do with contamination. The score changes every time

Lucew commented 10 months ago

To answer this properly, it depends on what you see as your output and how you change the contamination value to acquire different output scores.

1) If you create a model multiple times with different contamination: Randomness is the source of your variations. The trees split the data in random locations every time. The theory says that the outlier element will take fewer splits to be separated from the others with a high probability, therefore higher up in the tree (this path length is also the outlier score). But this is just probability, the actual pass length will vary per training.

2) If you see the binary score [outlier, not outlier] as your result: This is dependent on the contamination parameter, as I have mentioned in the post you referenced. It computes the distribution of path depths in the current tree and sets a threshold so that outlier/overall_points hast the ratio you defined with contamination.

You can also see this in the source code of the training base algorithm in scikit-learn, which is used here internally. "self.contamination" is only used after the trees have been computed to create the binary score: [outlier, not outlier]. It is by code (!) not part of the training.

samuel01028 commented 10 months ago

I believe that is due to the built in randomness of iforest. It has nothing to do with contamination. The score changes every time

For me, the iforest model was built with the same random_state each time, so it couldn't have any randomness right?

samuel01028 commented 10 months ago

To answer this properly, it depends on what you see as your output and how you change the contamination value to acquire different output scores.

  1. If you create a model multiple times with different contamination: Randomness is the source of your variations. The trees split the data in random locations every time. The theory says that the outlier element will take fewer splits to be separated from the others with a high probability, therefore higher up in the tree (this path length is also the outlier score). But this is just probability, the actual pass length will vary per training.
  2. If you see the binary score [outlier, not outlier] as your result: This is dependent on the contamination parameter, as I have mentioned in the post you referenced. It computes the distribution of path depths in the current tree and sets a threshold so that outlier/overall_points hast the ratio you defined with contamination.

You can also see this in the source code of the training base algorithm in scikit-learn, which is used here internally. "self.contamination" is only used after the trees have been computed to create the binary score: [outlier, not outlier]. It is by code (!) not part of the training.

Thank you so much for your reply. I chose consistent random_state in each model creation to eliminate the randomness of the tree, so that changing the contamination constantly changed binary score [outlier, not outlier]. This is just like supervised learning where I have to know a proportion in advance for outlier sample detection.