ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.38k stars 1.67k forks source link

bivar category data heatmap #344

Open cqcn1991 opened 4 years ago

cqcn1991 commented 4 years ago

Missing functionality When I do analysis between category features, I always need to know what percentage of one kind relates to another.

Proposed feature From https://nbviewer.jupyter.org/github/kaveio/phik/blob/master/python/phik/notebooks/phik_tutorial_basic.ipynb

I think category data heatmap can be very helpful and not very difficult to integrate into pandas profiling. something like this

image

The things to notice is that to confine the supported number of category component, otherwise the heatmap would be extremely long

sbrugman commented 4 years ago

This is an excellent suggestion. In fact, the next diagram in the notebook, the outlier significance, is even more informative. The pairwise correlations in the diagram don't are necessarily significant.

The values displayed in the matrix are the signficiances of the outlier frequencies, i.e. a large value means that the measured frequency for that bin is significantly different from the expected frequency in that bin.

We are planning to integrate both in the next release.

sbrugman commented 4 years ago

Pending the https://github.com/dylan-profiler/heatmaps release.