ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.48k stars 1.68k forks source link

Config Descriptions ValidationError #1108

Open ktavabi opened 2 years ago

ktavabi commented 2 years ago

Current Behaviour

When adding a description to the profiler configuration using **{kwargs} I am getting a

ValidationError: 1 validation error for Settings descriptions extra fields not permitted (type=value_error.extra)

Expected Behaviour

I am expecting a profile report with said description, as seen here

Data Description

I am reproducing the error with the example meteorite dataset

Code that reproduces the bug

import pandas as pd
import pandas_profiling as pp

description = "Disclaimer: this profiling report was generated using a sample of 5% of the original dataset."

edf = pd.read_csv('Meteorite_Landings.csv')
prfl = pp.ProfileReport(edf.sample(frac=.05), **{"descriptions": description})

pandas-profiling version

3.2.0

Dependencies

Python version: Python 3.9.13 (clang-1316.0.21.2.5) on darwin installed via pyenv

Environment: ipython

appnope==0.1.3
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
asttokens==2.0.8
attrs==22.1.0
backcall==0.2.0
basemap-data==1.3.2
basemap-data-hires==1.3.2
beautifulsoup4==4.11.1
black==22.10.0
bleach==5.0.1
certifi==2022.6.15
cffi==1.15.1
charset-normalizer==2.1.1
click==8.1.3
cycler==0.11.0
debugpy==1.6.3
decorator==5.1.1
defusedxml==0.7.1
entrypoints==0.4
executing==0.10.0
fastjsonschema==2.16.1
flake8==5.0.4
Flask==2.2.2
fonttools==4.37.1
geographiclib==1.52
geopy==2.2.0
gevent==21.12.0
greenlet==1.1.3
htmlmin==0.1.12
idna==3.3
ImageHash==4.2.1
importlib-metadata==4.12.0
ipykernel==6.15.1
ipython==8.4.0
ipython-genutils==0.2.0
ipywidgets==8.0.1
itsdangerous==2.1.2
jedi==0.18.1
Jinja2==3.1.2
joblib==1.1.0
jsonschema==4.14.0
jupyter==1.0.0
jupyter-console==6.4.4
jupyter-core==4.11.1
jupyter_client==7.3.5
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.2
kiwisolver==1.4.4
lazy_loader==0.1rc2
lxml==4.9.1
MarkupSafe==2.1.1
matplotlib==3.5.3
matplotlib-inline==0.1.6
mccabe==0.7.0
missingno==0.5.1
mistune==2.0.4
multimethod==1.8
multipledispatch==0.6.0
mypy-extensions==0.4.3
natsort==8.1.0
nbclient==0.6.7
nbconvert==7.0.0
nbformat==5.4.0
nest-asyncio==1.5.5
networkx==2.8.6
notebook==6.4.12
numpy==1.23.2
packaging==21.3
pandas==1.4.3
pandas-flavor==0.3.0
pandas-profiling==3.2.0
pandocfilters==1.5.0
parso==0.8.3
pathspec==0.10.1
patsy==0.5.2
pexpect==4.8.0
phik==0.12.2
pickleshare==0.7.5
Pillow==9.2.0
platformdirs==2.5.2
prometheus-client==0.14.1
prompt-toolkit==3.0.30
psutil==5.9.1
ptyprocess==0.7.0
pure-eval==0.2.2
pycodestyle==2.9.1
pycparser==2.21
pydantic==1.9.2
pyflakes==2.5.0
Pygments==2.13.0
pyjanitor==0.23.1
pyparsing==3.0.9
pyproj==3.3.1
pyrsistent==0.18.1
python-dateutil==2.8.2
pytz==2022.2.1
PyWavelets==1.3.0
PyYAML==6.0
pyzmq==23.2.1
qtconsole==5.3.2
QtPy==2.2.0
requests==2.28.1
scipy==1.9.1
seaborn==0.11.2
Send2Trash==1.8.0
Shapely==1.8.4
six==1.16.0
soupsieve==2.3.2.post1
stack-data==0.5.0
statsmodels==0.13.2
tangled-up-in-unicode==0.2.0
terminado==0.15.0
tinycss2==1.1.1
tomli==2.0.1
tornado==6.2
tqdm==4.64.0
traitlets==5.3.0
typing_extensions==4.4.0
urllib3==1.26.12
visions==0.7.4
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==2.2.2
widgetsnbextension==4.0.2
wsgi-oauth2==0.2.2
xarray==2022.6.0
zipp==3.8.1
zope.event==4.5.0
zope.interface==5.4.0

OS

macos

Checklist

fabclmnt commented 1 year ago

@ktavabi can you please validate if the behavior remains with the latest version 3.4.0

aquemy commented 1 year ago

@ktavabi can you please validate if the behavior remains with the latest version 3.4.0

I was able to reproduce in 3.5.0.

aquemy commented 1 year ago

@ktavabi @fabclmnt the issue is in the documentation and in the code by @ktavabi.

The issue in the documentation:

profile = sample.profile_report(description=description, minimal=True)

The problem is that the field description is nested under the category dataset.

The issue with the code snippet:

prfl = pp.ProfileReport(edf.sample(frac=.05), **{"descriptions": description})

The field descriptions (with an s) is used to describe the columns of the dataset.

The proper behavior can be achieved using:

prfl = pp.ProfileReport(edf.sample(frac=.05), **{"dataset":{"description": description}})

alternatively:

prfl = pp.ProfileReport(edf.sample(frac=.05), dataset={"description": description})

image

The documentation page on the dataset metadata is up to date (https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/metadata.html#dataset-metadata) so it is only about this section: https://pandas-profiling.ydata.ai/docs/master/pages/use_cases/big_data.html#sample-the-dataset

I will only modify the documentation.