ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.45k stars 1.68k forks source link

Hidden index conversion with ProfileReport #620

Open lionettis opened 3 years ago

lionettis commented 3 years ago

A pandas.DataFrame with columns indexed by an Int64Index object gets the column index converted to Index(..., dtype='object') upon calling ProfileReport.

To Reproduce

import pandas as pd
df = pd.DataFrame([{0: "a", 1: "b"},])
print(df.columns)
from pandas_profiling import ProfileReport
profile = ProfileReport(df)
print(df.columns)

Output:

Int64Index([0, 1], dtype='int64')
Index(['0', '1'], dtype='object')

Version information:

Running on jupyter-lab, Python 3.8.5.

pip freeze

absl-py==0.10.0 alembic==1.4.1 argon2-cffi==20.1.0 astor==0.8.1 astunparse==1.6.3 async-generator==1.10 attrs==20.2.0 azure-core==1.8.2 azure-storage-blob==12.5.0 backcall==0.2.0 bleach==3.2.1 blis==0.4.1 cachetools==4.1.1 catalogue==1.0.0 catboost==0.24.2 certifi==2020.6.20 cffi==1.14.3 chardet==3.0.4 click==7.1.2 cloudpickle==1.6.0 colorlover==0.3.0 combo==0.1.1 confuse==1.3.0 cryptography==3.2 cufflinks==0.17.3 cvxopt==1.2.5 cycler==0.10.0 cymem==2.0.3 databricks-cli==0.13.0 datefinder==0.7.1 dbus-python==1.2.16 decorator==4.4.2 defusedxml==0.6.0 docker==4.3.1 dtw-python==1.1.6 entrypoints==0.3 et-xmlfile==1.0.1 Flask==1.1.2 funcy==1.15 future==0.18.2 gast==0.3.3 gensim==3.8.3 gitdb==4.0.5 GitPython==3.1.11 google-auth==1.22.1 google-auth-oauthlib==0.4.1 google-pasta==0.2.0 gorilla==0.3.0 graphviz==0.14.2 grpcio==1.33.1 gunicorn==20.0.4 h5py==2.10.0 htmlmin==0.1.12 hyperopt==0.2.5 idna==2.8 ImageHash==4.1.0 imbalanced-learn==0.7.0 iniconfig==1.1.1 ipykernel==5.3.4 ipython==7.18.1 ipython-genutils==0.2.0 ipywidgets==7.5.1 isodate==0.6.0 itsdangerous==1.1.0 jdcal==1.4.1 jedi==0.17.2 Jinja2==2.11.2 joblib==0.17.0 json5==0.9.5 jsonschema==3.2.0 jupyter-client==6.1.7 jupyter-console==6.2.0 jupyter-core==4.6.3 jupyterlab==2.2.9 jupyterlab-pygments==0.1.2 jupyterlab-server==1.2.0 jupyterlab-templates==0.2.5 kaleido==0.0.3.post1 Keras-Preprocessing==1.1.2 kiwisolver==1.2.0 kmodes==0.10.2 lightgbm==3.0.0 llvmlite==0.34.0 Mako==1.1.3 Markdown==3.3.3 MarkupSafe==1.1.1 matplotlib==3.3.2 missingno==0.4.2 mistune==0.8.4 mlflow==1.11.0 mlxtend==0.17.3 msrest==0.6.19 murmurhash==1.0.2 nbclient==0.5.1 nbconvert==6.0.7 nbformat==5.0.8 nest-asyncio==1.4.2 networkx==2.5 nltk==3.5 notebook==6.1.4 numba==0.51.2 numexpr==2.7.1 numpy==1.19.2 oauthlib==3.1.0 openpyxl==3.0.5 opt-einsum==3.3.0 packaging==20.4 pandas==1.1.3 pandas-profiling==2.9.0 pandocfilters==1.4.3 parso==0.7.1 patsy==0.5.1 pexpect==4.8.0 phik==0.10.0 pickleshare==0.7.5 Pillow==8.0.1 plac==1.1.3 plotly==4.9.0 pluggy==0.13.1 preshed==3.0.2 prometheus-client==0.8.0 prometheus-flask-exporter==0.18.1 prompt-toolkit==3.0.8 protobuf==3.13.0 ptyprocess==0.6.0 py==1.9.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycaret==2.1.2 pycparser==2.20 Pygments==2.7.2 PyGObject==3.36.0 pyLDAvis==2.1.2 pyod==0.8.3 pyparsing==2.4.7 pyrsistent==0.17.3 pytest==6.1.1 python-apt==2.0.0+ubuntu0.20.4.1 python-dateutil==2.8.1 python-editor==1.0.4 pytz==2020.1 PyWavelets==1.1.1 PyYAML==5.3.1 pyzmq==19.0.2 querystring-parser==1.2.4 regex==2020.10.23 requests==2.24.0 requests-oauthlib==1.3.0 requests-unixsocket==0.2.0 retrying==1.3.3 rsa==4.6 scikit-learn==0.23.2 scikit-plot==0.3.7 scipy==1.5.3 seaborn==0.11.0 Send2Trash==1.5.0 shap==0.36.0 six==1.15.0 sklearn==0.0 slicer==0.0.4 smart-open==3.0.0 smmap==3.0.4 spacy==2.3.2 SQLAlchemy==1.3.13 sqlparse==0.4.1 srsly==1.0.2 statsmodels==0.12.0 suod==0.0.4 tabulate==0.8.7 tangled-up-in-unicode==0.0.6 tensorboard==2.2.2 tensorboard-plugin-wit==1.7.0 tensorflow==2.2.0 tensorflow-docs===0.0.0ba943b2a740157625a1e2ec4fc59c9a6171eb44f- tensorflow-estimator==2.2.0 termcolor==1.1.0 terminado==0.9.1 testpath==0.4.4 textblob==0.15.3 thinc==7.4.1 threadpoolctl==2.1.0 toml==0.10.1 tornado==6.0.4 tqdm==4.51.0 traitlets==5.0.5 umap-learn==0.4.6 urllib3==1.25.8 visions==0.5.0 wasabi==0.8.0 wcwidth==0.2.5 webencodings==0.5.1 websocket-client==0.57.0 Werkzeug==1.0.1 widgetsnbextension==3.5.1 wordcloud==1.8.0 wrapt==1.12.1 xgboost==1.2.1 xlrd==1.2.0 yellowbrick==1.2

sbrugman commented 3 years ago

Great catch, we should create a test for this and fix it :)

DeepOde commented 3 years ago

Great catch, we should create a test for this and fix it :)

I want to fix it :) but I haven't ever contributed to open-source projects, I found that constructor sets self.df = df, if it is never intendent to mutate the original df, shall I change it so that it creates new DataFrame object (if time required to copy doesn't prohibits us from doing so). Sorry, I'd have asked this on Slack but the link is broken!

sbrugman commented 3 years ago

@DeepOde feel free to have a go at it (updated slack link).