ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.43k stars 1.67k forks source link

Bug Report: Kurtosis at constant columns values #1323

Open pedro-tofani opened 1 year ago

pedro-tofani commented 1 year ago

Current Behaviour

I am trying to generate a report but it but it throws an error.

    187 descriptive_statistics = Table(
    188     [
    189         {
    190             "name": "Standard deviation",
    191             "value": fmt_numeric(summary["std"], precision=config.report.precision),
    192         },
    193         {
    194             "name": "Coefficient of variation (CV)",
    195             "value": fmt_numeric(summary["cv"], precision=config.report.precision),
    196         },
    197         {
    198             "name": "Kurtosis",
--> 199             "value": fmt_numeric(
    200                 summary["kurtosis"], precision=config.report.precision
    201             ),
    202         },

File /opt/conda/lib/python3.10/site-packages/ydata_profiling/report/formatters.py:232, in fmt_numeric(value, precision)
    221 @list_args
    222 def fmt_numeric(value: float, precision: int = 10) -> str:
    223     """Format any numeric value.
    224 
    225     Args:
   (...)
    230         The numeric value with the given precision.
    231     """
--> 232     fmtted = f"{{:.{precision}g}}".format(value)
    233     for v in ["e+", "e-"]:
    234         if v in fmtted:

TypeError: unsupported format string passed to NoneType.__format__

I think it is because pyspark.sql.functions.kurtosis function returns None for constant columns

df.select(kurtosis(df.column_name)).show()
+--------------+
|kurtosis(column_name)|
+--------------+
|          null   |
+--------------+

Expected Behaviour

It was expected to generate the report.

Data Description

My data has two columns that all the values are constants.

Code that reproduces the bug

report_df = ProfileReport(df)

pandas-profiling version

4.1.2

Dependencies

pyspark==3.3.2

OS

Linux

Checklist

fabclmnt commented 1 year ago

Hi @pedro-tofani ,

thank you for your issue. Can you please provide more details?

Based on your dependency I'am assuming your are using ydata-profiling with pyspark dataframes, but can you confirm?

We will double check if this is related with the kurtosis for constant variables as per your suggestion. And let you know!

pedro-tofani commented 1 year ago

Hello @fabclmnt Yes, I am using with pyspark dataframes. Basically when I try to run the ProfileReport on a dataframe that doesn't have constant columns and it works. I searched the issues and found this topics:

fabclmnt commented 1 year ago

Thanks for confirming! I'll add this to the backlog.