ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.52k stars 1.68k forks source link

decimalls are not supported normally #1602

Open ibobak opened 5 months ago

ibobak commented 5 months ago

Current Behaviour

Spark Dataframe structure:

root
 |-- device_id: string (nullable = true)
 |-- device_install_date: date (nullable = true)
 |-- max_device_event_date: date (nullable = true)
 |-- distinct_play_days: long (nullable = true)
 |-- sessions: long (nullable = true)
 |-- playtime_sec_total: decimal(38,6) (nullable = true)
 |-- intersession_sec_sum: decimal(38,6) (nullable = true)
 |-- playtime_sec_per_session: decimal(38,6) (nullable = true)
 |-- playtime_sec_per_playing_day: decimal(38,6) (nullable = true)
 |-- days_since_install: long (nullable = true)
 |-- avg_ses_between_sessions: decimal(38,6) (nullable = true)
 |-- loyalty_index: double (nullable = true)
 |-- install_date: date (nullable = true)

code:

from ydata_profiling import ProfileReport

report = ProfileReport(df_basic_features_3, minimal=True, title=app_code)
report.to_file(f"profiling/{app_code}_features_3.html")  

Look what distribution it produced for playtime_sec_total: image

Now I converted this dataframe to the Pandas dataframe and here is what I see indeed: image

So, conclusion is this: the product is totally buggy with this type of fields, and I don't trust it any more.

Expected Behaviour

You need to fix the handling of decimal fields.

Data Description

see above

Code that reproduces the bug

see above

pandas-profiling version

ydata-profiling==4.8.3

Dependencies

a2wsgi==1.10.4
aiohttp==3.9.5
aiosignal==1.3.1
alembic==1.13.1
altair==5.3.0
annotated-types==0.6.0
anyio==4.3.0
apache-airflow==2.7.1
apache-airflow-providers-common-sql==1.13.0
apache-airflow-providers-ftp==3.9.0
apache-airflow-providers-http==4.11.0
apache-airflow-providers-imap==3.6.0
apache-airflow-providers-sqlite==3.8.0
apispec==6.6.1
argcomplete==3.3.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
arviz==0.16.1
asgiref==3.8.1
asn1crypto==1.5.1
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
attrs==23.2.0
Babel==2.15.0
backcall==0.2.0
backoff==2.2.1
bcrypt==4.1.3
beautifulsoup4==4.12.3
bleach==6.1.0
blinker==1.8.2
boto3==1.28.29
botocore==1.31.85
build==1.2.1
cachelib==0.9.0
cachetools==5.3.3
cattrs==23.2.3
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
chroma-hnswlib==0.7.3
chromadb==0.4.24
click==8.1.7
cloudpickle==2.2.1
colorama==0.4.6
coloredlogs==15.0.1
colorlog==4.8.0
comm==0.2.2
ConfigUpdater==3.2
connexion==3.0.6
cons==0.4.6
contourpy==1.2.1
cron-descriptor==1.4.3
croniter==2.0.5
cryptography==42.0.7
cycler==0.12.1
dacite==1.8.1
databricks-cli==0.18.0
dataclasses-json==0.6.6
debugpy==1.8.1
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.14
dill==0.3.8
dnspython==2.6.1
docker==6.1.3
docutils==0.21.2
email-validator==1.3.1
entrypoints==0.4
et-xmlfile==1.1.0
etuples==0.3.9
exceptiongroup==1.2.0
executing==2.0.1
fastapi==0.111.0
fastapi-cli==0.0.4
fastjsonschema==2.19.1
fastprogress==1.0.3
filelock==3.14.0
Flask==2.2.5
Flask-AppBuilder==4.3.6
Flask-Babel==2.0.0
Flask-Caching==2.3.0
Flask-JWT-Extended==4.6.0
Flask-Limiter==3.7.0
Flask-Login==0.6.3
Flask-Session==0.8.0
Flask-SQLAlchemy==2.5.1
Flask-WTF==1.2.1
flatbuffers==24.3.25
fonttools==4.51.0
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.5.0
gitdb==4.0.11
GitPython==3.1.43
google-auth==2.29.0
google-re2==1.1.20240501
googleapis-common-protos==1.63.0
graphviz==0.20.3
greenlet==3.0.3
grpcio==1.64.0
gunicorn==20.1.0
h11==0.14.0
h5netcdf==1.3.0
h5py==3.11.0
htmlmin==0.1.12
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.23.2
humanfriendly==10.0
idna==3.7
ImageHash==4.3.1
importlib-metadata==6.11.0
importlib_resources==6.4.0
inflection==0.5.1
ipykernel==6.19.2
ipynb-py-convert==0.4.6
ipython==8.10.0
ipython-genutils==0.2.0
ipywidgets==7.6.5
isoduration==20.11.0
itsdangerous==2.2.0
jedi==0.19.1
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
json5==0.9.25
jsonpatch==1.33
jsonpointer==2.4
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
jupyter-contrib-core==0.4.2
jupyter-contrib-nbextensions==0.7.0
jupyter-events==0.10.0
jupyter-highlight-selected-word==0.2.0
jupyter-lsp==2.2.5
jupyter-nbextensions-configurator==0.6.3
jupyter_client==7.4.4
jupyter_core==5.7.2
jupyter_server==2.14.0
jupyter_server_terminals==0.5.3
jupyterlab==4.2.1
jupyterlab-execute-time==3.1.2
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.2
jupyterlab_widgets==3.0.10
kiwisolver==1.4.5
kubernetes==29.0.0
langchain==0.1.13
langchain-community==0.0.38
langchain-core==0.1.52
langchain-text-splitters==0.0.2
langsmith==0.1.67
lazy-object-proxy==1.10.0
lazyprofiler==0.1.1
limits==3.12.0
linkify-it-py==2.0.3
llvmlite==0.42.0
lockfile==0.12.2
logical-unification==0.4.6
lxml==5.2.2
Mako==1.3.5
Markdown==3.6
markdown-it-py==3.0.0
MarkupSafe==2.1.5
marshmallow==3.21.2
marshmallow-oneofschema==3.1.1
marshmallow-sqlalchemy==0.26.1
matplotlib==3.8.4
matplotlib-inline==0.1.7
mdit-py-plugins==0.4.1
mdurl==0.1.2
miniKanren==1.0.3
mistune==3.0.2
mlflow==2.5.0
mmh3==4.1.0
more-itertools==10.2.0
mpmath==1.3.0
msgspec==0.18.6
multidict==6.0.5
multimethod==1.11.2
multipledispatch==1.0.0
mypy-extensions==1.0.0
nbclassic==1.0.0
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.3
notebook==7.2.0
notebook_shim==0.2.4
numba==0.59.1
numpy==1.23.5
oauthlib==3.2.2
onnx==1.15.0
onnxconverter-common==1.14.0
onnxmltools==1.12.0
onnxruntime==1.17.1
openai==1.22.0
openpyxl==3.1.2
opentelemetry-api==1.24.0
opentelemetry-exporter-otlp==1.24.0
opentelemetry-exporter-otlp-proto-common==1.24.0
opentelemetry-exporter-otlp-proto-grpc==1.24.0
opentelemetry-exporter-otlp-proto-http==1.24.0
opentelemetry-instrumentation==0.46b0
opentelemetry-instrumentation-asgi==0.46b0
opentelemetry-instrumentation-fastapi==0.46b0
opentelemetry-proto==1.24.0
opentelemetry-sdk==1.24.0
opentelemetry-semantic-conventions==0.45b0
opentelemetry-util-http==0.46b0
optuna==3.5.0
optuna-fast-fanova==0.0.4
ordered-set==4.1.0
orjson==3.10.3
overrides==7.7.0
packaging==23.2
pandas==1.5.3
pandas-datareader==0.10.0
pandasql==0.7.3
pandocfilters==1.5.1
parso==0.8.4
pathspec==0.12.1
patsy==0.5.6
pendulum==2.1.2
pexpect==4.9.0
pgcopy==1.6.0
phik==0.12.4
pickleshare==0.7.5
pillow==10.3.0
platformdirs==4.2.2
plotly==5.22.0
pluggy==1.5.0
posthog==3.5.0
prison==0.2.1
prometheus_client==0.20.0
prompt-toolkit==3.0.43
protobuf==3.20.2
psutil==5.9.8
psycopg2==2.9.9
psycopg2-binary==2.9.7
ptyprocess==0.7.0
pulsar-client==3.5.0
pure-eval==0.2.2
pyarrow==12.0.1
pyasn1_modules==0.4.0
pycountry==23.12.11
pycparser==2.22
pydantic==2.7.0
pydantic_core==2.18.1
pydeck==0.9.1
Pygments==2.18.0
PyJWT==2.8.0
pymc==5.6.0
pyparsing==3.1.2
pypdf==4.1.0
PyPika==0.48.9
pyproject_hooks==1.1.0
pytensor==2.12.3
python-daemon==3.0.1
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-json-logger==2.0.7
python-multipart==0.0.9
python-nvd3==0.16.0
python-slugify==8.0.4
pytz==2023.4
pytzdata==2020.1
PyWavelets==1.6.0
PyYAML==6.0.1
pyzmq==26.0.3
querystring-parser==1.2.4
redshift-connector==2.0.911
referencing==0.35.1
requests==2.32.3
requests-oauthlib==2.0.0
requests-toolbelt==1.0.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.7.1
rich-argparse==1.4.0
rpds-py==0.18.1
s3transfer==0.6.2
scikit-learn==1.3.2
scipy==1.12.0
scramp==1.4.5
seaborn==0.12.2
Send2Trash==1.8.3
setproctitle==1.3.3
shap==0.42.1
shellingham==1.5.4
six==1.16.0
skl2onnx==1.16.0
slicer==0.0.7
smart-open==6.3.0
smmap==5.0.1
sniffio==1.3.1
soupsieve==2.5
spark_framework @ git+https://github.com/ibobak/spark_framework.git@8dcf0f5b29e71721d4d6069a76ae4fde1e7e7bde
SQLAlchemy==1.4.49
SQLAlchemy-JSONField==1.0.2
SQLAlchemy-Utils==0.41.2
sqlparse==0.5.0
stack-data==0.6.3
starlette==0.37.2
statsmodels==0.14.2
streamlit==1.32.2
sympy==1.12
tabulate==0.9.0
tenacity==8.0.1
termcolor==2.4.0
terminado==0.18.1
text-unidecode==1.3
threadpoolctl==3.5.0
tinycss2==1.3.0
tokenizers==0.19.1
tomli==2.0.1
toolz==0.12.1
tornado==6.2
tqdm==4.66.2
traitlets==5.9.0
typeguard==4.3.0
typer==0.12.3
types-python-dateutil==2.9.0.20240316
typing-inspect==0.9.0
typing_extensions==4.12.0
tzdata==2024.1
uc-micro-py==1.0.3
ujson==5.10.0
unicodecsv==0.14.1
uri-template==1.3.0
urllib3==2.0.7
uvicorn==0.30.0
uvloop==0.19.0
visions==0.7.6
watchdog==4.0.1
watchfiles==0.22.0
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.8.0
websockets==12.0
Werkzeug==3.0.3
widgetsnbextension==3.5.2
wordcloud==1.9.3
wrapt==1.16.0
WTForms==3.1.2
xarray==2024.3.0
xarray-einstats==0.7.0
xgboost==2.0.2
XlsxWriter==3.2.0
yarl==1.9.4
ydata-profiling==4.8.3
zipp==3.18.2

OS

Ubuntu 22.04

Checklist

fabclmnt commented 4 months ago

Hi @ibobak ,

thank you for reporting the issue. Regarding ydata-profiling for spark it is clear that we have only launched one initial version that not only includes only a small set of functionality but also have some know issues.

We are looking for contributors that are willing to keep evolving the Spark integration, as this was something initiated by the community. If you're open to it, feel free to check the issues labelled with the tag spark.