ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.4k stars 1.67k forks source link

Bug Report: Null columns not supported for Spark dataframe #1305

Open danaack opened 1 year ago

danaack commented 1 year ago

Current Behaviour

When attempting to profile a Spark dataframe that contains an entirely null column, the process errors.

When the null column is of type integer, the error message is KeyError: '50%' as thrown by ydata_profiling/model/spark/describe_numeric_spark.py:102, in describe_numeric_1d_spark(config, df, summary).

When the null column is a string, the error message is ZeroDivisionError: division by zero as thrown by ydata_profiling/model/spark/describe_supported_spark.py:31, in describe_supported_spark(config, series, summary).

Expected Behaviour

A profile should be produced for the Spark dataframe even with null value columns. The profiler works as expected for the same data when passed as a Pandas dataframe.

Data Description

Any Spark dataframe with an entirely null column:

df.withColumn('empty1', lit(None).cast('string')).withColumn('empty2', lit(None).cast('integer'))

Code that reproduces the bug

# Follow the Spark Databricks example code: https://github.com/ydataai/ydata-profiling/blob/master/examples/integrations/databricks/ydata-profiling%20in%20Databricks.ipynb

# Add the following lines to df before running ProfileReport
df = (
  df
  .withColumn('empty1', lit(None).cast('string'))
  .withColumn('empty2', lit(None).cast('integer'))
)

pandas-profiling version

v4.1.2

Dependencies

numpy==1.21.5
pandas==1.4.2
ydata-profiling==4.1.2

OS

No response

Checklist

daanlute93 commented 1 year ago

Hi, I second the experience of issue poster @danaack here.

I think there will be a quick fix by changing line 31 in file: src/ydata_profiling/model/spark/describe_supported_spark.py

From: summary["p_unique"] = n_unique / count To: summary["p_unique"] = n_unique / count if count > 0 else 0.0

Or something like that, but I am not sure what the specific type of the 'p_unique' key in the summary dict should be. Since this is a relatively easy fix, do you know how fast something like this will be changed @azory-ydata @fabclmnt ?

fabclmnt commented 1 year ago

Hi @daanlute93,

we have a release process. We can integrate this issue into our June release. Nevertheless, it can be integrated in next week's in case you are willing to contribute with the fix?

Does that sound good?

daanlute93 commented 1 year ago

Hi @fabclmnt, I would like to contribute to the fix to help out, but the process of contributing to an open source repository is quite new for me. So I would not know how to set up my contribution. If there is any guide / how to I would like to contribute. Can you help me out with this? If not, I would have to wait for the June release.