xuhongzuo / DeepOD

Deep learning-based outlier/anomaly detection
https://deepod.readthedocs.io/
BSD 2-Clause "Simplified" License
434 stars 50 forks source link

Training process #51

Open Hu1-Li opened 8 months ago

Hu1-Li commented 8 months ago
  1. about the train process, why in deepod there is no validation dataset?

  2. for the decision function

    clf = ...
    clf.fit(X_train)
    scores = clf.decision_function(X_test)

then i use roc_curve(y_test, scores) the get the best threshold, then use this threshold as parameter for later use. is this right?

xuhongzuo commented 8 months ago
  1. deepod does not support using validation set for now, if you are interested, you can submit a pull request, I am very happy to merge your contribution.

  2. higher score indicates high likelihood of being anomalies, but we do not have auto threshold setting. The mentioned method can give a threshold for best F1, it can be a solution. Pyod has API for threshold setting, you can also have a look at PyOD.


Hongzuo Xu College of Computer National University of Defense Technology 137 Yanwachi St., Changsha, China Email: @., @. ---- Replied Message ---- | From | Li @.> | | Date | 2/19/2024 19:31 | | To | @.> | | Cc | @.***> | | Subject | [xuhongzuo/DeepOD] Training process (Issue #51) |

about the train process, why in deepod there is no validation dataset?

for the decision function

clf = ... clf.fit(X_train) scores = clf.decision_function(X_test)

then i use roc_curve(y_test, scores) the get the best threshold, then use this threshold as parameter for later use. is this right?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

asha24choudhary commented 6 months ago

Hello. I want to know how the threshold is set? I can see what the threshold is but I want to know the math behind a threshold being selected for instance if we use TranAD. Also while training TranAD I want to know what the optimal way a time series data should be like? Should the data have time as a separate column or it should have time as index or we can just have an array of data where the timing info can be neglected? Also is some kind of preprocessing like min-max is needed before training the data?