ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.41k stars 1.67k forks source link

The "unique (%)" field appears to be using "distinct" #539

Closed jmadison222 closed 4 years ago

jmadison222 commented 4 years ago

The "Unique (%)" field appears to be just a percentage restatement of the notion of "Distinct".

Using the famous Iris data set, the Sepal Length field has 22 distinct values, and 9 unique values, out of 150 observations, where distinct and unique are defined as:

This is NOT what appears in the Pandas Profiler. Instead, what appears to be happening is this:

Now, as the authors of this product, you may not want all four of these, but the proper values, I believe, should be:

If you're okay taking up the real estate, do add all four. But if you can only fit two, do the counts. But either way, don't restate the distinct as a percentage and call it unique.

If I'm missing something, I apologize, but things do seem out of whack.

To reproduce, run the profiler on the Iris data set, and review the Sepal Length. To verify my unique count, hack it the old fashion way: cat iris.csv | cut -d, -f1 | sort | head -150 > unique, then vim the file and dd away rows that aren't unique. I'm sure there's a slicker way, but it only takes a minute. And this is where I got my count of 9.

But pick your own favorite data set, and just see that Unique (%) is nothing more than Distinct Count over Number of Observations, which I believe is the wrong definition.

Python 3.6.9 :: Anaconda, Inc. On RHEL 7.

Click to expand Version information

absl-py==0.7.1 addressable==1.4.2 aenum==2.1.2 affine==2.3.0 alabaster==0.7.12 alembic==1.0.8 altair==4.0.1 altgraph==0.16.1 aniso8601==8.0.0 appdirs==1.4.3 asn1crypto==0.24.0 astor==0.7.1 astroid==2.2.5 astropy==4.0.1.post1 astunparse==1.6.2 async-generator==1.10 atomicwrites==1.3.0 attrs==19.3.0 autograd==1.2 Automat==0.7.0 autopep8==1.4.4 avro-python3==1.9.0 aws==0.2.5 awscli==1.16.158 azure-common==1.1.24 azure-storage-blob==2.1.0 azure-storage-common==2.1.0 Babel==2.7.0 backcall==0.1.0 backports.functools-lru-cache==1.6.1 bcolz==1.2.1 bcrypt==3.1.6 beautifulsoup4==4.7.1 bitarray==0.8.3 black==19.3b0 bleach==3.1.0 blessed==1.16.1 blis==0.2.4 bokeh==1.0.4 boto==2.49.0 boto3==1.9.187 botocore==1.12.148 Bottleneck==1.2.1 branca==0.3.1 bs4==0.0.1 cached-property==1.5.1 cachetools==3.1.1 catboost==0.17.5 category-encoders==1.3.0 CensusData==1.4 censusgeocode==0.4.3.post1 certifi==2020.6.20 cffi==1.12.2 chardet==3.0.4 ciso8601==2.1.3 Click==7.0 click-plugins==1.0.4 cligj==0.5.0 cloudpickle==0.8.0 cnn-finetune==0.6.0 colorama==0.3.9 colorlover==0.3.0 colour==0.1.5 configparser==3.7.4 confuse==1.3.0 constantly==15.1.0 convertdate==2.2.0 cryptography==2.6.1 cssselect==1.0.3 cufflinks==0.15 cursor==1.2.0 cvxopt==1.2.3 cvxpy==1.0.24 cx-Oracle==7.1.2 cycler==0.10.0 cymem==2.0.2 Cython==0.29.10 cytoolz==0.9.0.1 dash==1.0.0 dash-core-components==1.0.0 dash-html-components==1.0.0 dash-renderer==1.0.0 dash-table==4.0.0 dask==1.1.4 databricks-cli==0.8.7 dataclasses==0.6 datacompy==0.6.0 datatable==0.9.0 deap==1.3.1 decorator==4.3.2 defusedxml==0.5.0 descartes==1.1.0 dfply==0.3.3 dill==0.2.9 dimod==0.8.18 distlib==0.3.0 distributed==1.26.0 distro==1.4.0 Django==2.2.6 dnspython==1.16.0 docker==4.0.2 docker-compose==1.25.0 dockerpty==0.4.1 docopt==0.6.2 docutils==0.14 dwave-cloud-client==0.6.2 dwave-hybrid==0.4.1 dwave-neal==0.5.2 dwave-networkx==0.8.3 dwave-ocean-sdk==1.5.0 dwave-qbsolv==0.2.10 dwave-system==0.8.0 dwave-tabu==0.2.2 dwavebinarycsp==0.0.12 ecos==2.0.7.post1 entrypoints==0.3 enum34==1.1.6 et-xmlfile==1.0.1 eventlet==0.24.1 ExifRead==2.1.2 extract-msg==0.23.2 fabric==2.4.0 fake-useragent==0.1.11 Faker==1.0.4 fastai==1.0.52 fastavro==0.23.5 fastcluster==1.1.25 fastprogress==0.1.21 fbprophet==0.5 ffmpeg==1.4 filelock==3.0.12 Fiona==1.8.9.post2 Flask==1.0.2 Flask-Compress==1.4.0 Flask-RESTful==0.3.8 Flask-Session==0.3.1 Flask-SQLAlchemy==2.4.1 folium==0.10.1 funcy==1.13 future==0.17.1 fuzzywuzzy==0.17.0 gast==0.2.2 GDAL==2.3.3 gensim==3.8.0 geographiclib==1.50 geojson==2.5.0 geopandas==0.6.1 geopy==1.20.0 gitdb2==2.0.5 GitPython==2.1.11 google-api-python-client==1.4.2 google-auth==1.6.3 google-auth-httplib2==0.0.3 google-pasta==0.1.8 gorilla==0.3.0 graphviz==0.8.4 great-expectations==0.9.3 greenlet==0.4.15 gremlinpython==3.4.2 grpcio==1.19.0 gunicorn==19.9.0 h2o==3.18.0.11 h2o-pysparkling-2.1==2.1.41 h5py==2.9.0 halo==0.0.23 hdbscan==0.8.22 hdfs==2.2.2 HeapDict==1.0.0 holidays==0.9.11 holoviews==1.12.7 homebase==1.0.1 htmlmin==0.1.12 httplib2==0.12.3 hyperlink==18.0.0 ibm-db==3.0.1 icc-rt==2019.0 idna==2.8 ijson==2.5.1 ImageHash==4.1.0 imagesize==1.1.0 IMAPClient==2.1.0 imbalanced-learn==0.4.3 imblearn==0.0 importlib-metadata==0.18 importlib-resources==1.5.0 impyla==0.14.2.2 imutils==0.5.2 incremental==17.5.0 inflection==0.3.1 inspect-it==0.3.2 intel-openmp==2019.0 intervaltree==3.0.2 invoke==1.2.0 ipydatawidgets==4.0.1 ipykernel==4.10.1 ipython==6.4.0 ipython-genutils==0.2.0 ipyvolume==0.5.2 ipywebrtc==0.5.0 ipywidgets==7.5.1 isodate==0.6.0 isort==4.3.21 isoweek==1.3.3 itsdangerous==1.1.0 JayDeBeApi==1.1.1 jdcal==1.4 jedi==0.13.3 jellyfish==0.5.6 Jinja2==2.11.2 jmespath==0.9.4 joblib==0.13.2 JPype1==0.6.3 jsonlines==1.2.0 jsonschema==3.0.2 jupyter-client==5.2.4 jupyter-core==4.4.0 jupyterhub==0.9.4 jupyterlab==0.35.4 jupyterlab-code-formatter==0.2.1 jupyterlab-server==0.2.0 Keras==2.2.4 Keras-Applications==1.0.7 Keras-Preprocessing==1.0.9 keyring==5.3 kiwisolver==1.0.1 kmodes==0.10.1 koalas==0.6.0 lazy-object-proxy==1.4.2 ldap3==2.6.1 lifelines==0.21.0 lightgbm==2.2.3 lime==0.1.1.36 llvmlite==0.29.0 locket==0.2.0 log-symbols==0.0.12 ludwig==0.2.1 lunardate==0.2.0 lxml==4.5.1 macholib==1.11 Mako==1.0.7 mapclassify==2.0.1 Markdown==3.0.1 MarkupSafe==1.1.1 marshmallow==2.21.0 matplotlib==3.3.0 matplotlib-venn==0.11.5 mccabe==0.6.1 metaflow==2.0.0 minorminer==0.1.9 missingno==0.4.2 mistune==0.8.4 mizani==0.6.0 mkl==2019.0 mkl-fft==1.1.0 mkl-random==1.1.1 mkl-service==2.3.0 mlflow==1.1.0 mlxtend==0.17.2 mock==2.0.0 monotonic==1.5 more-itertools==6.0.0 mpld3==0.3 mpmath==1.1.0 msgpack==0.6.1 multiprocess==0.70.7 munch==2.3.2 murmurhash==1.0.2 mysql==0.0.2 mysqlclient==1.4.6 natsort==6.0.0 nbconvert==5.4.1 nbdime==1.0.5 nbformat==4.4.0 networkx==2.4 nltk==3.4.4 nose==1.3.7 notebook==5.7.8 ntlm-auth==1.2.0 num2words==0.5.10 numba==0.43.0 numexpr==2.6.9 numpy==1.19.1 numpydoc==0.9.1 nvidia-ml-py3==7.352.0 oauth2client==1.5.2 olefile==0.46 opencv-contrib-python-headless==4.0.0.21 opencv-python-headless==4.0.0.21 openpyxl==2.6.1 ortools==7.4.7247 oscrypto==1.2.0 osqp==0.5.0 packaging==19.0 palettable==3.1.1 pamela==1.0.0 pandas==1.1.0 pandas-bokeh==0.4.2 pandas-profiling==2.8.0 pandas-summary==0.0.6 pandasql==0.7.3 pandocfilters==1.4.2 param==1.9.2 paramiko==2.4.2 parse==1.11.1 parsel==1.5.1 parso==0.3.4 partd==0.3.10 pathlib-mate==0.0.15 patsy==0.5.1 pbr==5.1.3 pdfkit==0.6.1 pdftotext==2.1.1 pefile==2019.4.14 penaltymodel==0.16.2 penaltymodel-cache==0.4.0 penaltymodel-lp==0.1.0 penaltymodel-mip==0.2.1 percy==2.0.2 pexpect==4.6.0 phik==0.10.0 pickleshare==0.7.5 Pillow==7.2.0 pixiedust==1.1.17 plac==0.9.6 plotly==4.2.1 plotnine==0.6.0 plucky==0.4.3 pluggy==0.12.0 ply==3.11 pm4py==1.2.10 postgres==3.0.0 ppscore==0.0.2 preshed==2.0.1 pretrainedmodels==0.7.4 prettytable==0.7.2 prince==0.6.3 prometheus-client==0.6.0 prompt-toolkit==1.0.15 prophet==0.1.1 protobuf==3.7.0 psutil==5.6.1 psycopg2==2.7.7 psycopg2-binary==2.8.2 psycopg2-pool==1.1 ptyprocess==0.6.0 PuLP==2.0 py==1.8.0 py4j==0.10.7 pyarrow==0.12.1 pyasn1==0.4.5 pyasn1-modules==0.2.4 pycodestyle==2.5.0 pycosat==0.6.3 pycparser==2.19 pycrypto==2.6.1 pycryptodomex==3.9.4 PyDispatcher==2.0.5 pydoop==2.0.0 pydotplus==2.0.2 pydqc==0.1.0 pyee==5.0.0 pyflakes==2.1.1 Pygments==2.3.1 PyHamcrest==1.9.0 PyHive==0.6.2 PyInstaller==3.4 pyIsEmail==1.3.2 PyJWT==1.7.1 pyLDAvis==2.1.2 pylev==1.3.0 pylint==2.3.1 PyMeeus==0.3.6 pyminizip==0.2.4 PyMySQL==0.9.3 PyNaCl==1.3.0 pyodbc==4.0.26 PyOpenGL==3.1.0 pyOpenSSL==19.0.0 pyparsing==2.3.1 PyPDF2==1.26.0 pyppeteer==0.0.25 pyproj==2.4.0 PyQt5==5.9.2 pyqubo==0.4.0 pyquery==1.4.0 pyreadr==0.2.1 pyrsistent==0.14.11 pysal==2.0.0 pyshp==2.1.0 PySocks==1.6.8 pyspark==2.1.3 pystan==2.17.1.0 pytesseract==0.2.6 pytest==4.6.3 pytest-mock==1.10.4 pytest-sugar==0.9.2 python-dateutil==2.8.0 python-editor==1.0.4 python-engineio==3.4.3 python-geohash==0.8.5 python-ldap==3.2.0 python-oauth2==1.0.1 python-snappy==0.5.4 python-socketio==4.0.0 pythreejs==2.1.1 pytz==2018.9 pyvis==0.1.7.0 pyviz-comms==0.7.3 PyWavelets==1.0.2 PyYAML==3.13 pyzmq==18.0.1 QtAwesome==0.6.0 qtconsole==4.5.5 QtPy==1.9.0 Quandl==3.4.6 querystring-parser==1.2.4 queuelib==1.5.0 rasterio==1.1.0 rasterstats==0.13.1 rawdata==0.1.0 redis==3.2.0 regex==2020.1.8 requests==2.24.0 requests-html==0.10.0 requests-ntlm==1.1.0 requests-toolbelt==0.9.1 retrying==1.3.3 rope==0.14.0 rpy2==3.0.5 rsa==3.4.2 rsconnect-python==1.4.4.1 Rtree==0.8.3 ruamel.yaml==0.15.46 s3transfer==0.2.0 sas7bdat==2.2.3 sasl==0.2.1 saspy==2.4.3 scikit-image==0.14.2 scikit-learn==0.20.3 scikit-plot==0.3.7 scipy==1.5.2 scp==0.13.1 Scrapy==1.6.0 scs==2.1.0 seaborn==0.9.0 selenium==3.141.0 Send2Trash==1.5.0 service-identity==18.1.0 setuptools-git==1.2 shap==0.29.3 shape==1.0.0 Shapely==1.6.4.post2 SharePlum==0.2.0 sharepoint==0.4.2 sharepy==1.3.0 simple-http-server==0.1.7 simple-salesforce==0.74.2 simplegeneric==0.8.1 simplejson==3.16.0 simpy==3.0.11 sip==4.19.8 six==1.15.0 sklearn-contrib-py-earth==0.1.0 sklearn-gbmi==1.0.0 sklearn-pandas==1.8.0 smart-open==1.8.4 smmap2==2.0.5 snakebite-py3==3.0.5 snakify==1.1.1 snowballstemmer==1.9.1 snowflake-connector-python==2.2.4 snuggs==1.4.7 sortedcontainers==2.1.0 soupsieve==1.8 spacy==2.1.4 spark-df-profiling==1.1.13 Sphinx==2.2.0 sphinxcontrib-applehelp==1.0.1 sphinxcontrib-devhelp==1.0.1 sphinxcontrib-htmlhelp==1.0.2 sphinxcontrib-jsmath==1.0.1 sphinxcontrib-qthelp==1.0.2 sphinxcontrib-serializinghtml==1.1.3 spinners==0.0.23 spyder==3.2.8 sql-metadata==1.7.1 SQLAlchemy==1.3.1 sqlparse==0.3.1 srsly==0.0.5 statsmodels==0.9.0 stopit==1.1.2 swifter==0.284 sympy==1.4 tables==3.5.1 tabula-py==1.4.1 tabulate==0.8.3 tangled-up-in-unicode==0.0.6 tbb==2019.0 tbb4py==2019.0 tblib==1.3.2 tensorboard==1.13.1 tensorflow==1.13.1 tensorflow-estimator==1.13.0 termcolor==1.1.0 terminado==0.8.2 testpath==0.4.2 text-unidecode==1.2 texttable==1.6.2 thinc==7.0.4 threadpoolctl==2.0.0 thrift==0.13.0 thrift-sasl==0.2.1 thriftpy==0.3.9 TM1py==1.3.1 toml==0.10.0 toolz==0.9.0 torch==1.0.0 torchvision==0.2.2.post3 tornado==5.1 TPOT==0.11.3 tqdm==4.48.0 traitlets==4.3.2 traittypes==0.2.1 treeinterpreter==0.2.2 tweedie==0.0.7 Twisted==18.9.0 typed-ast==1.4.0 typesentry==0.2.7 typing==3.6.6 tzlocal==1.5.1 unicode-slugify==0.1.3 Unidecode==1.1.0 update-checker==0.17 uritemplate==3.0.0 urllib3==1.24.1 us==1.0.0 uszipcode==0.2.2 virtualenv==20.0.19 visidata==1.5.2 visions==0.4.4 w3lib==1.20.0 waitress==1.3.0 wasabi==0.2.2 wcwidth==0.1.7 webencodings==0.5.1 websocket-client==0.56.0 websockets==7.0 Werkzeug==0.14.1 widgetsnbextension==3.5.1 wrapt==1.11.2 xgbfir==0.3.1 xgboost==0.90 xlrd==1.2.0 XlsxWriter==1.1.8 xlwt==1.3.0 yapf==0.26.0 zict==0.1.4 zipcodes==1.0.5 zipp==0.5.1 zope.interface==4.6.0

sbrugman commented 4 years ago

Great catch, that's indeed a mistake. Thanks for taking the time to write up this complete report, should be fixed in the next release.