ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.22k stars 1.65k forks source link

Correlations stopped working #1527

Open hakan-77 opened 5 months ago

hakan-77 commented 5 months ago

Current Behaviour

Trying to create a profile with default settings, correlations do not work for some relatively simple data sets with the below error:

I think this issue started with 4.6.3 and is still the case for 4.6.4. EDIT: I can confirm that downgrading to 4.6.2 solves the issue.

/home/ubuntu/.local/lib/python3.10/site-packages/ydata_profiling/model/correlations.py:66: UserWarning: There was an attempt to calculate the auto correlation, but this failed.
To hide this warning, disable the calculation
(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'Function <code object pandas_auto_compute at 0x7f34bdca3260, file "/home/ubuntu/.local/lib/python3.10/site-packages/ydata_profiling/model/pandas/correlations_pandas.py", line 164>')

Expected Behaviour

Correlations work

Data Description

Standard boston data set

Code that reproduces the bug

import pandas as pd
df = pd.read_csv("boston.csv")

from ydata_profiling import ProfileReport

profile_report = ProfileReport(df, title="profile")
profile_report.to_file("test.html")

pandas-profiling version

v4.6.4

Dependencies

pandas==2.1.4

OS

Ubuntu 22

Checklist

driscoll42 commented 5 months ago

I was looking into this a bit as I was running into the issue. It's something with pandas going from 2.0.3 to 2.1.x. For ydata-profiling v.4.6.4 it works fine with pandas v2.0.3 but once you upgrade to pandas v2.1.x the autocorrelation stops working. Won't claim to know what in pandas is causing the break, but if you downgrade to pandas 2.0.3 it'll work again.

hakan-77 commented 5 months ago

@driscoll42 good catch. I can confirm that the reason 4.6.2 works is that it pins pandas < 2.1. The below pr relaxed pandas pin and thus broke correlations.

https://github.com/ydataai/ydata-profiling/pull/1512

@aquemy @ricardodcpereira any idea what could be wrong?

hakan-77 commented 4 months ago

@aquemy @ricardodcpereira is there anything I can help with?

SilasK commented 3 months ago

Could it be to the newer pandas datatypes. There are now nullable datatypes for string, float etc. with pandas.NA as missing values.

I get many issues where data attempts to convert sting to float:

include the error message: 'could not convert string to float: 'positive''

jtsekine commented 3 months ago

include the error message: 'could not convert string to float: `'positive''

Maybe after pandas 2.0, we need to add numeric_only = True in pandas.Dataframe.corr() Changed in version 2.0.0: The default value of numeric_only is now False.

eamander commented 3 weeks ago

I believe this line will also have to be updated to this or its equivalent:

        method = (
            _pairwise_spearman
           if col_1_name not in categorical_columns and col_2_name not in categorical_columns
            else _pairwise_cramers
        )

Setting numeric_only = True and making the above change ensures the report renders with both categorical an numerical features; otherwise it throws a TypeError on categorical columns if they show up as col_1_name.