yahoo / egads

A Java package to automatically detect anomalies in large scale time-series data
Other
1.17k stars 330 forks source link

Usage question: Working with streaming data sets #53

Closed sheldonkreger closed 5 years ago

sheldonkreger commented 6 years ago

I'm considering this library for a project where I periodically fetch metric data and want to test each new data point against previous values to see if the latest value is an anomaly.

However, the configuration files and tests seem to use static data sets (CSV files). The entire data set is loaded upfront, and then the analysis is performed. For example: https://github.com/yahoo/egads/blob/master/src/test/java/com/yahoo/egads/TestOlympicModel.java#L85

On the other hand, it appears that a TimeSeries object can be instantiated and values added without using a CSV. https://github.com/yahoo/egads/blob/master/src/test/java/com/yahoo/egads/TestTimeSeries.java#L23

I'm wondering if it makes sense to instantiate a Time Series object, and then recreate the model and perform the analysis each time I update it with a new data point, or if there is a preferred method for handling streaming data.

Thank you.

sheldonkreger commented 5 years ago

For those curious, what I ended up doing is rebuilding the TSM and ADM each time I have a new data point. It is also possible to update them, but this raises complexities such as: "How many new data points do I have and which are new?", "Should I simply append new data points, or remove old ones as well?" Furthermore, because the algorithms need to run against every data point again in the context of a new data point, I don't think there is opportunity for performance gain.