Open danaack opened 1 year ago
Hi, I second the experience of issue poster @danaack here.
I think there will be a quick fix by changing line 31 in file: src/ydata_profiling/model/spark/describe_supported_spark.py
From: summary["p_unique"] = n_unique / count
To: summary["p_unique"] = n_unique / count if count > 0 else 0.0
Or something like that, but I am not sure what the specific type of the 'p_unique' key in the summary dict should be. Since this is a relatively easy fix, do you know how fast something like this will be changed @azory-ydata @fabclmnt ?
Hi @daanlute93,
we have a release process. We can integrate this issue into our June release. Nevertheless, it can be integrated in next week's in case you are willing to contribute with the fix?
Does that sound good?
Hi @fabclmnt, I would like to contribute to the fix to help out, but the process of contributing to an open source repository is quite new for me. So I would not know how to set up my contribution. If there is any guide / how to I would like to contribute. Can you help me out with this? If not, I would have to wait for the June release.
Current Behaviour
When attempting to profile a Spark dataframe that contains an entirely null column, the process errors.
When the null column is of type integer, the error message is
KeyError: '50%'
as thrown byydata_profiling/model/spark/describe_numeric_spark.py:102, in describe_numeric_1d_spark(config, df, summary)
.When the null column is a string, the error message is
ZeroDivisionError: division by zero
as thrown byydata_profiling/model/spark/describe_supported_spark.py:31, in describe_supported_spark(config, series, summary)
.Expected Behaviour
A profile should be produced for the Spark dataframe even with null value columns. The profiler works as expected for the same data when passed as a Pandas dataframe.
Data Description
Any Spark dataframe with an entirely null column:
Code that reproduces the bug
pandas-profiling version
v4.1.2
Dependencies
OS
No response
Checklist