ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.27k stars 1.66k forks source link

Feature : Correlations #847

Open chanedwin opened 2 years ago

chanedwin commented 2 years ago

Overview : Spark Development Strategy

Branch : spark-branch

Feature :

Three types of correlations - Cramer's V, Kendall's correlations and Phi-K are implemented in pandas-profiling, but not in spark-profiling. We would need to implement them in spark in an optimised manner.

Tips to Get Started :

chanedwin commented 2 years ago

phik - done in https://github.com/pandas-profiling/pandas-profiling/commit/b3b41cc0d127ac3dac3480cd94a55f9556b671dc

rishabsinghh commented 2 years ago

@chanedwin Hi, I would like to get started on this bug. Could you guide me with some code?

chanedwin commented 2 years ago

hi @rishabsinghh! Sure! I'll update with more in this post in a bit. You can DM me on the pp slack too if you have any further questions!

rishabsinghh commented 2 years ago

Sure, will be waiting. I didn't get the PP slack? Like how can reach you through that?

chanedwin commented 2 years ago

Code : take a look at this! https://github.com/chanedwin/pandas-profiling/blob/d9ee4a8a589e075cfced9fc71ca500a20e2a3e73/src/pandas_profiling/model/correlations.py#L140

This was my original implementation using vectorized pandasUDFs for Kendall and Cramer's V, but I think we should do this in native spark if possible because we should see significant speed improvements (although that might not be so trivial). We can continue discussions on slack!

You can join the slack here!