ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.55k stars 1.69k forks source link

vars.num.quantiles does not behave as one would naturally expect #646

Open pedrosan opened 3 years ago

pedrosan commented 3 years ago

Describe the bug

Profiling breaks if the list of values for vars.num.quantiles does not include the five default values ([0.05,0.25,0.5,0.75,0.95]) because of how reports are coded, https://github.com/pandas-profiling/pandas-profiling/blob/develop/src/pandas_profiling/report/structure/variables/render_real.py#L126-L133

The documentation of vars.num.quantiles does not imply such a requirement:

vars.num.quantiles: "The quantiles to calculate. Note that .25, .5 and .75 are required for other metrics median and IQR."

Additional comments

  1. The widgets/HTML rendered report does not show additional user-defined quantiles. They are included in the json export. Nevertheless, it would be desirable to include the user-defined quantiles in the widgets/HTML reports.
  2. I would suggest to remove the requirement that the user makes sure to pass 0.25, 0.5, 0.75 and instead compute them internally by default to be able to always compute median and IQR irrespective of user input.

Error

running it with vars={'num': {'quantiles': [0.01, 0.25, 0.5, 0.75, 0.99]}}:

~/venv2/lib64/python3.7/site-packages/pandas_profiling/report/structure/report.py in render_variables_section(dataframe_summary)
    136 
    137         # Per type template variables
--> 138         template_variables.update(type_to_func[summary["type"]](template_variables))
    139 
    140         # Ignore these

~/venv2/lib64/python3.7/site-packages/pandas_profiling/report/structure/variables/render_real.py in render_real(summary)
    125         [
    126             {"name": "Minimum", "value": summary["min"], "fmt": "fmt_numeric"},
--> 127             {"name": "5-th percentile", "value": summary["5%"], "fmt": "fmt_numeric"},
    128             {"name": "Q1", "value": summary["25%"], "fmt": "fmt_numeric"},
    129             {"name": "median", "value": summary["50%"], "fmt": "fmt_numeric"},

KeyError: '5%'

To Reproduce

Data: It fails on any float column

Code: Preferably, use this code format:

"""
Test for issue XXX:
https://github.com/pandas-profiling/pandas-profiling/issues/646
"""
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

df_test = pd.DataFrame(
    np.random.rand(100, 5),
    columns=["a", "b", "c", "d", "e"]
)

profile = ProfileReport(df_test, 
                        title="Pandas Profiling Report", 
                        minimal=True, 
                        vars={'num': {'quantiles': [0.01, 0.25, 0.5, 0.75, 0.99]}}
                       )

profile.to_widgets()

Version information:

* _Python version_: 3.7.9
* _Environment_: Jupyter Notebook
sbrugman commented 3 years ago

@pedrosan I agree that we could improve upon this behaviour following your suggestions. Code contributions are very much welcome.