ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.51k stars 1.68k forks source link

bin problem in histograms #478

Closed pvojnisek closed 4 years ago

pvojnisek commented 4 years ago

Describe the bug I have a data table of 79 observations and 50 variables. I generate the profiling report regularly. I have used pandas-profiling 2.1 for quite a long time. There are some variables with the discrete values of (0, 1, 2, 3, 4, 5). The histogram looked like this in version 2.1: image The same report in version 2.8 looks like this: image

In version 2.8 there are 10 bins created which is not the best solution in this case. I have tried to set the number of bins manually in the yaml config file but it was unsuccessful. I am not sure if it is a bug or it is only my misunderstanding of the configuration and parameters. Please help me to solve this problem! Thanks a lot!

To Reproduce

We would need to reproduce your scenario before being able to resolve it.

Data:

 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
31  OK75_num           79 non-null     int64

The values are (0, 1, 2, 3, 4, 5) values.

Code: Preferably, use this code format:

from pandas_profiling import ProfileReport profile = ProfileReport(df, config_file='profiler_settings.yml') profile.to_file("profiler_report.html")

pool_size: 7
plot:
    histogram:
        bins: 6
        bayesian_blocks_bins: no
    image_format: svg
    dpi: 800
    scatter_threshold: 1000
    correlation:
        cmap: RdBu
        bad: '#000000'
    missing:
        cmap: RdBu
        force_labels: yes
title: "Report title"
progress_bar: yes
variables:
    descriptions: {}
vars:
    num:
        quantiles:
        - 0.05
        - 0.25
        - 0.5
        - 0.75
        - 0.95
        skewness_threshold: 20
        low_categorical_threshold: 5
        chi_squared_threshold: 0
    cat:
        length: yes
        unicode: no
        cardinality_threshold: 50
        n_obs: 6
        chi_squared_threshold: 0
        coerce_str_to_date: no
    bool:
        n_obs: 3
    file:
        active: no
    image:
        active: no
        exif: yes
        hash: yes
sort: None
missing_diagrams:
    bar: no
    matrix: no
    heatmap: no
    dendrogram: no
correlations:
    pearson:
        calculate: no
        warn_high_correlations: yes
        threshold: 0.9
    spearman:
        calculate: no
        warn_high_correlations: no
    kendall:
        calculate: no
        warn_high_correlations: no
    phi_k:
        calculate: no
        warn_high_correlations: no
    cramers:
        calculate: no
        warn_high_correlations: yes
        threshold: 0.9
    recoded:
        calculate: no
        warn_high_correlations: no
        threshold: 0.0
interactions:
    targets: []
    continuous: yes
categorical_maximum_correlation_distinct: 6
n_obs_unique: 5
n_extreme_obs: 5
n_freq_table_max: 6
#n_freq_table_max: 10
memory_deep: no
duplicates:
    head: 10
samples:
    head: 5
    tail: 5
reject_variables: no
notebook:
    iframe:
        height: 800px
        width: 100%
        attribute: srcdoc
html:
    minify_html: yes
    use_local_assets: yes
    inline: yes
    navbar_show: yes
    file_name: None
    style:
        theme: None
        #theme: "flatly"
        logo: ''
        primary_color: '#337ab7'
        #primary_color: "#2c3e50"
        full_width: yes

Version information:

Click to expand Version information: alabaster==0.7.12 altgraph==0.16.1 anaconda-client==1.7.2 anaconda-navigator==1.9.7 anaconda-project==0.8.2 appdirs==1.4.3 asn1crypto==0.24.0 astroid==2.2.5 astropy==4.0.1.post1 atomicwrites==1.3.0 attrs==19.3.0 Babel==2.6.0 backcall==0.1.0 backports.os==0.1.1 backports.shutil-get-terminal-size==1.0.0 beautifulsoup4==4.7.1 bitarray==0.8.3 bkcharts==0.2 bleach==3.1.0 bokeh==1.0.4 boto==2.49.0 Bottleneck==1.2.1 cached-property==1.5.1 certifi==2019.3.9 cffi==1.12.2 chardet==3.0.4 Click==7.0 cloudpickle==0.8.0 clyent==1.2.2 colorama==0.4.1 conda==4.6.11 conda-build==3.17.8 conda-verify==3.1.1 confuse==1.0.0 contextlib2==0.5.5 cryptography==2.6.1 cycler==0.10.0 Cython==0.29.6 cytoolz==0.9.0.1 dask==2.5.2 decorator==4.4.0 defusedxml==0.5.0 distributed==2.5.2 docutils==0.14 entrypoints==0.3 et-xmlfile==1.0.1 fastcache==1.0.2 filelock==3.0.10 Flask==1.0.2 funcsigs==1.0.2 future==0.17.1 gevent==1.4.0 glob2==0.6 gmpy2==2.0.8 greenlet==0.4.15 h5py==2.9.0 heapdict==1.0.0 html5lib==1.0.1 htmlmin==0.1.12 idna==2.8 ImageHash==4.1.0 imageio==2.5.0 imagesize==1.1.0 importlib-metadata==0.0.0 ipykernel==5.1.0 ipython==7.4.0 ipython-genutils==0.2.0 ipywidgets==7.5.1 isodate==0.6.0 isort==4.3.16 itsdangerous==1.1.0 jdcal==1.4 jedi==0.13.3 jeepney==0.4 Jinja2==2.11.2 joblib==0.15.1 jsonschema==3.0.1 jupyter==1.0.0 jupyter-client==5.2.4 jupyter-console==6.0.0 jupyter-core==4.4.0 jupyterlab==0.35.4 jupyterlab-server==0.2.0 keyring==18.0.0 kiwisolver==1.0.1 lazy-object-proxy==1.3.1 libarchive-c==2.8 lief==0.9.0 llvmlite==0.28.0 locket==0.2.0 lxml==4.3.2 MarkupSafe==1.1.1 matplotlib==3.2.1 mccabe==0.6.1 missingno==0.4.2 mistune==0.8.4 mkl-fft==1.0.10 mkl-random==1.0.2 modin==0.6.1 more-itertools==6.0.0 mpmath==1.1.0 msgpack==0.6.1 multipledispatch==0.6.0 navigator-updater==0.2.1 nbconvert==5.4.1 nbformat==4.4.0 networkx==2.4 nltk==3.4 nose==1.3.7 notebook==5.7.8 numba==0.43.1 numexpr==2.6.9 numpy==1.16.2 numpydoc==0.8.0 olefile==0.46 openpyxl==2.6.1 packaging==19.0 pandas==1.0.3 pandas-profiling==2.8.0 pandocfilters==1.4.2 parso==0.3.4 partd==0.3.10 path.py==11.5.0 pathlib2==2.3.3 patsy==0.5.1 pep8==1.7.1 pexpect==4.6.0 phik==0.9.12 pickleshare==0.7.5 Pillow==5.4.1 pkginfo==1.5.0.1 pluggy==0.9.0 ply==3.11 prometheus-client==0.6.0 prompt-toolkit==2.0.9 protobuf==3.10.0 psutil==5.6.1 ptyprocess==0.6.0 py==1.8.0 pycodestyle==2.5.0 pycosat==0.6.3 pycparser==2.19 pycrypto==2.6.1 pycurl==7.43.0.2 pyflakes==2.1.1 Pygments==2.3.1 PyInstaller==3.5 pylint==2.3.1 pyodbc==4.0.26 pyOpenSSL==19.0.0 pyparsing==2.3.1 pyrsistent==0.14.11 pyserial==3.4 PySimpleGUI==4.1.0 PySocks==1.6.8 pytest==4.3.1 pytest-arraydiff==0.3 pytest-astropy==0.5.0 pytest-doctestplus==0.3.0 pytest-openfiles==0.3.2 pytest-pylint==0.14.0 pytest-remotedata==0.3.1 python-dateutil==2.8.0 pytz==2018.9 PyWavelets==1.0.2 PyYAML==5.1 pyzmq==18.0.0 QtAwesome==0.5.7 qtconsole==4.4.3 QtPy==1.7.0 ray==0.7.3 redis==3.3.8 requests==2.23.0 requests-toolbelt==0.9.1 retrying==1.3.3 rope==0.12.0 ruamel-yaml==0.15.46 scikit-image==0.14.2 scikit-learn==0.20.3 scipy==1.4.1 seaborn==0.9.0 SecretStorage==3.1.1 Send2Trash==1.5.0 simplegeneric==0.8.1 singledispatch==3.4.0.3 six==1.12.0 snowballstemmer==1.2.1 sortedcollections==1.1.2 sortedcontainers==2.1.0 soupsieve==1.8 Sphinx==1.8.5 sphinxcontrib-websupport==1.1.0 spyder==3.3.3 spyder-kernels==0.4.2 SQLAlchemy==1.3.1 statsmodels==0.9.0 sympy==1.3 tables==3.5.1 tangled-up-in-unicode==0.0.6 tblib==1.3.2 terminado==0.8.1 testpath==0.4.2 toolz==0.9.0 tornado==6.0.2 tqdm==4.46.0 traitlets==4.3.2 typed-ast==1.4.0 unicodecsv==0.14.1 urllib3==1.24.1 virtualenv==16.7.9 visions==0.4.4 wcwidth==0.1.7 webencodings==0.5.1 Werkzeug==0.14.1 widgetsnbextension==3.5.1 wrapt==1.11.1 wurlitzer==1.0.2 xlrd==1.2.0 XlsxWriter==1.1.5 xlwt==1.3.0 zeep==3.4.0 zict==0.1.4 zipp==0.3.3

loopyme commented 4 years ago

Analysis is wrong here, so I delete it

pvojnisek commented 4 years ago

Than you for your reply, @loopyme, I have tried with these settings:

plot:
    histogram:
        bins: 6
        bayesian_blocks_bins: yes

The result is not what I expected. Looks like this: image

The histogram should be build from this table: image

I think your idea about plot.histogram.bins is great and straight forward. I would imagine if someone will need to define different number of bins for each variable. Is it possible to define it somehow?

loopyme commented 4 years ago

After some tests, I found my error and I am very sorry that I pointed out the wrong bug position. I think cause of this problem is complex, the most direct reason(May not be the real reason) comes from here:

https://github.com/pandas-profiling/pandas-profiling/blob/58d4a5408e6064a22c4155466de3deac6c894b9a/src/pandas_profiling/report/structure/variables/render_real.py#L100

bayesian_blocks_bins can do something with the histogram, but it doesn't seem to work the way @pvojnisek want either. I believe the problem is not in the adjustment of the default config, but in the fact that plot.histogram.bins does not control the render behavior.

Different config for each variable is currently not supported, however.


@sbrugman Any comment or existing solution on the bug? I will continue to work on this issue if necessary.

sbrugman commented 4 years ago

Thanks for reporting this @pvojnisek. As @loopyme points out, the bin size used to be (unintelionally) hard-coded in render_real.py. The next release will pre-compute histograms earlier in the process anyway, which will in addition to more efficient parallelization include a fix for this problem.

sbrugman commented 4 years ago

The v2.9.0rc1 release is out, and should resolve this issue. Until this version is fully released, you can install it via pip in the following way:

pip install --pre -U pandas-profiling

It would be very helpful to know if the release candidate adequately solves the issue.