ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.51k stars 1.68k forks source link

Wrong stationarity alert in time series #1636

Open Blackandwhite23 opened 2 months ago

Blackandwhite23 commented 2 months ago

Current Behaviour

I made a report of a time series and then used the following code: description = profile.get_description() for col in df: var1 = description.variables.get(col) stat = var1.get('stationary') p = var1.get('addfuller') display("Column: " + col + " ; Stationary: " + str(stat) +" ; P: " + str(p))

I analysed a data set with some columns and get the following result: Column: Column 1 ; Stationary: False ; P: 8.367848162951153e-15 Column: Column 2 ; Stationary: False ; P: 1.0170622187220445e-11 Column: Column 3 ; Stationary: False ; P: 2.555609761088582e-05 Column: Column 4 ; Stationary: False ; P: 7.172269761903138e-08 Column: Column 5 ; Stationary: False ; P: 9.321131415426812e-18 Column: Column 6 ; Stationary: False ; P: 9.027089348108759e-15 Column: Column 7 ; Stationary: False ; P: 0.02133819126759494 Column: Column 8 ; Stationary: False ; P: 4.406120572138344e-12 Column: Column 9 ; Stationary: False ; P: 0.0028888647417244155 Column: Column 10 ; Stationary: False ; P: 0.00044090523969600784 Column: Column 11 ; Stationary: False ; P: 0.00286260675205775 Column: Column 12 ; Stationary: False ; P: 0.0001708455587419074 Column: Column 13 ; Stationary: False ; P: 9.472249697294651e-30 Column: Column 14 ; Stationary: False ; P: 2.526552913384979e-12 Column: Column 15 ; Stationary: False ; P: 0.000455609981090904 Column: Column 16 ; Stationary: False ; P: 0.0004254554235795494 Column: Column 17 ; Stationary: None ; P: None Column: Column 18 ; Stationary: None ; P: None Column: Column 19 ; Stationary: False ; P: 1.2239118466383953e-16 Column: Column 20 ; Stationary: True ; P: 9.06748511005521e-29 Column: Column 21 ; Stationary: True ; P: 0.005396832629069178 Column: Column 22 ; Stationary: True ; P: 1.850847639853015e-11

Expected Behaviour

I would expect, that it marks every column with a p-value of < 0.05 as "stationary".

Data Description

I used a private dataset

Code that reproduces the bug

description = profile.get_description()

for col in df:
  var1 = description.variables.get(col)
  stat = var1.get('stationary')
  p = var1.get('addfuller')
  display("Column: " + col + " ; Stationary: " + str(stat) +" ; P: " + str(p))

pandas-profiling version

v4.9.0

Dependencies

Package        Version
0  ydata_profiling         v4.9.0
1           pandas          2.1.4
2            numpy         1.26.4
3       matplotlib          3.7.1
4      statsmodels         0.14.2
5           Python        3.10.12
6               OS  Linux 6.1.85+

OS

Linux 6.1.85+

Checklist

Blackandwhite23 commented 2 months ago

Here some data in the attachment if needed ADF_Test.csv

Load dataset to dataframe

import pandas as pd filename = 'ADF_Test.csv' data = pd.read_csv(filename,sep=',',decimal='.', parse_dates=["time"], index_col="time") display(data)

import ydata_profiling

Create a profile report

profile = ydata_profiling.ProfileReport(data, title="ADF Test", explorative=True, tsmode=True) profile.to_notebook_iframe() profile.to_file("ADF_Test.html")

description = profile.get_description()

for col in data: var1 = description.variables.get(col) stat = var1.get('stationary') p = var1.get('addfuller') display("Column: " + col + " ; Stationary: " + str(stat) +" ; P: " + str(p))

And then I get: Column: Col1 ; Stationary: False ; P: 8.367848162944989e-15

quant12345 commented 1 month ago

Hi @Blackandwhite23! I looked at the file: describe_timeseries_pandas.py function pandas_describe_timeseries_1d, which returns the stationary and p_value. The function has a check for seasonal (if it is, return False(row 214)):

stats["stationary"] = is_stationary and not stats["seasonal"]

My knowledge of statistics is modest. From everything I've seen and read, I know that remove the trend. Here it is written that stationary ones do not have a trend and seasonality. And there is also a discussion here.