ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.48k stars 1.68k forks source link

Calculation of the cramers correlation fails when cross tabulation cannot be created. #417

Closed marisakamozz closed 4 years ago

marisakamozz commented 4 years ago

Description:

When a cross tabulation table cannot be created, calculation of the cramers correlation fails, and no cramers correlation coefficient is displayed at all, including those that can be calculated correctly.

To Reproduce:

import pandas as pd
import pandas_profiling

df = pd.DataFrame({'A': [1, 2, None, None], 'B': [None, None, 8, 9]})
df.profile_report()

Warning message:

/usr/local/Caskroom/miniconda/base/envs/pandas-profiling/lib/python3.7/site-packages/pandas_profiling/model/correlations.py:135: UserWarning: There was an attempt to calculate the cramers correlation, but this failed. To hide this warning, disable the calculation (using df.profile_report(correlations={"cramers": {"calculate": False}}) If this is problematic for your use case, please report this as an issue: https://github.com/pandas-profiling/pandas-profiling/issues (include the error message: 'No data; observed has size 0.') correlation_name=correlation_name, error=error

Version information:

python==3.7.7 pandas==0.25.3 pandas-profiling==2.5.0

Work around:

Filling missing values with some values.

df = pd.DataFrame({'A': [1, 2, None, None], 'B': [None, None, 8, 9]})
df.fillna('NA', inplace=True)  # filling missing values.
df.profile_report()
sbrugman commented 4 years ago

Thanks for reporting this. Purely on the basis of the example you are giving, a warning would be expected the way I see it. There are no pairs of A and B that do not have missing values. Filling the columns with 'NA' values changes the semantics of the data. It might be that you encountered an example where this isn't the case. In that case, please let us know.

Providing a warning is not perfect, we should look for a way to improve handling this case.

marisakamozz commented 4 years ago

Warning messages aren't the only problem. An additional problem is that no cramers correlation coefficients will be output, including those that can be calculated correctly.

df = pd.DataFrame({
    'A': [1, 2, None, None],
    'B': [None, None, 8, 9],
    'C': [3, 4, None, None]
})

In the above example, the cramers correlation coefficient of A and C can be calculated, but even that will not be output.

sbrugman commented 4 years ago

There are multiple strategies to deal with missing values in correlations. For the Cramer's V corrected stat it currently drops the pairs of variables where there is at least one obvervation that is missing for both variables. This choice should be documented.

Moving forward, let's implement one or multiple other strategies (such as complete case analysis). At least one feature that is missing is the propagation of missing values in the correlation matrix.

Reference:

sbrugman commented 4 years ago

Update: in the next release, columns are no longer dropped when one correlation coefficient could not be calculated. Instead, it is included in the plot. Note that this does not fully resolve this issue yet.

github-actions[bot] commented 4 years ago

Stale issue

tugcekonuklar commented 3 years ago

I still have the same issue. using panda-profiling v3.0.0

pandas_profiling/model/correlations.py:152: UserWarning: There was an attempt to calculate the cramers correlation, but this failed.
To hide this warning, disable the calculation
(using `df.profile_report(correlations={"cramers": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/pandas-profiling/pandas-profiling/issues
(include the error message: 'No data; `observed` has size 0.')
  (include the error message: '{error}')"""
xiaoyangxuoo021 commented 2 years ago

Same issue here There was an attempt to generate the Count missing values diagrams, but this failed. To hide this warning, disable the calculation (using df.profile_report(missing_diagrams={"Count": False}) If this is problematic for your use case, please report this as an issue: https://github.com/pandas-profiling/pandas-profiling/issues (include the error message: 'The number of FixedLocator locations (7), usually from a call to set_ticks, does not match the number of ticklabels (60).')