ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.42k stars 1.67k forks source link

Incorporate Categorical Variables in Interactions #1087

Open LyndonFan opened 1 year ago

LyndonFan commented 1 year ago

Missing functionality

I often encounter product data, which are grouped and categorised in different ways. Having found the library recently and wanting to use it, I would like to quickly see how these categories interact with each other.

I think this would be useful for datasets which can be grouped in different ways. e.g. For the titanic dataset, we can look at the gender of passengers and whether they survived. Another example is for a dataset of products, we may want to look at how the distribution of sales varies for products in different categories.

Proposed feature

In the interaction section of the report, we can allow categorical variables to be chosen alongside continous variables, with new types of plots as such:

Alternatives considered

The correlations do provide some indication about how the categories overlap but it does not always suffice.

Additional context

I am happy to work on the implementation of this feature.

fabclmnt commented 1 year ago

Hi @LyndonFan, definitely this something missing. We would be happy to have your contribution with the development of this feature.

aquemy commented 1 year ago

Hi @LyndonFan,

This would be very useful. However, note that with high cardinality variables, it might be difficult to visualize a heatmap of frequency. How would you solve this problem?

Would you consider something like Weight of Evidence and Information Value for instance? Low and high WoE and IV could be displayed as it is mostly what we are interested in I suppose.