ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.41k stars 1.67k forks source link

Bug Report: While running tsmode=True, some of the time series variables are identified as real numbers #1292

Closed knwyne20 closed 1 year ago

knwyne20 commented 1 year ago

Current Behaviour

I am running a dataset through pandas profiling using tsmode=true however some of the time dependent variables are coming off as real numbers in the pandas profile. Under a "VARIABLE" tab in pandas profile, i see histograms of these variables vs line graphs. Also, since some of these variables are automatically identified as real numbers, i don't get their autocorrelation graphs too containing ACF and PACF information. According to this:

"To enable a time-series report to be generated ts_mode needs to be set to “True”. If “True” the variables that have temporal dependence will be automatically identified based on the presence of autocorrelation."

Is there anyway i can tell pandas profile what variables are time dependent rather having it automatically identify those?

Expected Behaviour

I would need pandas profile to correctly identify all the time dependent variables and not as real numbers.

Data Description

My dataset is not publically available.

Code that reproduces the bug

No response

pandas-profiling version

ydata profiling 4.1.0

Dependencies

pandas==1.5.3
numpy==1.23.5

OS

windows 10

Checklist

knwyne20 commented 1 year ago

Hi! Was looking for update on this.

fabclmnt commented 1 year ago

Hi @knwyne20,

at the moment there is no way to inform what are the time variant variables to the profiling. The automation decides based on the autocorrelation level of the variables. If it below a certain threshold is considering them numerical.

Unfortunately this logic is causing some variables to be misidentified in your case.

I've update this issue as a feature request.

alexbarros commented 1 year ago

knwyne20 on version 4.1.0 (PR https://github.com/ydataai/ydata-profiling/pull/1274) a new feature was introduced that allows you to manually define the data types bypassing the type inference:

def create_dataframe(size=1000, alt=False):
    time_steps = np.arange(size)
    return pd.DataFrame(
        {
            "ascending_sequence": time_steps,
            "sin": map(lambda x: round(np.sin(x * np.pi / 180), 2), time_steps),
            "cos": map(lambda x: round(np.cos(x * np.pi / 180), 2), time_steps),
            "cat": np.random.choice([0,1,2], size=size, replace=True),
        }
    )

df = create_dataframe()
prof = ProfileReport(
    df,
    tsmode=True,
    type_schema={
        "ascending_sequence": "categorical",
        "sin": "timeseries",
        "cos": "numeric",
        "cat": "numeric",
    })
prof.to_file("profile.html")
aquemy commented 1 year ago

Hi @knwyne20,

As the feature that you requested is already available (see https://github.com/ydataai/ydata-profiling/issues/1292#issuecomment-1544552358), I am closing this issue.

Feel free to re-open in case the solution is not satisfying.