Open chanedwin opened 2 years ago
@chanedwin Hi, I would like to get started on this bug. Could you guide me with some code?
hi @rishabsinghh! Sure! I'll update with more in this post in a bit. You can DM me on the pp slack too if you have any further questions!
Sure, will be waiting. I didn't get the PP slack? Like how can reach you through that?
Code : take a look at this! https://github.com/chanedwin/pandas-profiling/blob/d9ee4a8a589e075cfced9fc71ca500a20e2a3e73/src/pandas_profiling/model/correlations.py#L140
This was my original implementation using vectorized pandasUDFs for Kendall and Cramer's V, but I think we should do this in native spark if possible because we should see significant speed improvements (although that might not be so trivial). We can continue discussions on slack!
You can join the slack here!
Overview : Spark Development Strategy
Branch : spark-branch
Feature :
Three types of correlations - Cramer's V, Kendall's correlations and Phi-K are implemented in pandas-profiling, but not in spark-profiling. We would need to implement them in spark in an optimised manner.
Tips to Get Started :