ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.39k stars 1.67k forks source link

Crashes with memory leak, seems to be deadlock related #1550

Open jeff-hykin opened 6 months ago

jeff-hykin commented 6 months ago

Current Behaviour

Reading a 1 column, 7 row file causes a total lock up. (Happens with a bigger file, but I shrunk it down)

I think this could be different from this issue and this issue

Here is the CLI output:

Screen Shot 2024-02-20 at 12 02 17 PM
importing
reading data
loading as csv
generating report: './main/inputs.ignore.report.html'
Summarize dataset:   0%|                                                                                                        | 0/5 [00:00<?, ?it/s]
zsh: killed     ydata ./main/inputs.ignore.csv
/opt/homebrew/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Expected Behaviour

Generate the html report

Data Description

Here's the contents of the CSV.

NOTE1: Removing even 1 row no longer causes the freeze/hangup

NOTE2: Despite the "fragility", the behavior is consistent. E.g. it always works with 1 row removed, and always hangs when all rows are present

data
1.8979166666666665
1.8770833333333332
1696285500.0
1.8
1.8010416666666667
1.8114583333333334

Code that reproduces the bug

#!/usr/bin/env python3
print(f'''importing''')
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport
import pandas as pd
from io import StringIO
import sys
import os
# pip install ydata-profiling

print(f'''reading data''')
filepath = sys.argv[1]
with open(filepath,'r') as f:
    output = f.read()

kwargs = dict(sep=",")
if output.startswith("#"):
    kwargs["comment"] = "#"

if output.count('\t') > output.count(','):
    kwargs["sep"] = "\t"

# Use StringIO to create a file-like object from the string
print(f'''loading as csv''')
df = pd.read_csv(StringIO(output))
profile = ProfileReport(df, title="Profiling Report")
new_path_base = os.path.dirname(filepath)
basename = os.path.basename(filepath)
if "." not in basename:
    new_path_base += f"/{basename}"
else:
    new_path_base += "/" + ".".join(basename.split(".")[0:-1])

report_path = f"{new_path_base}.report.html"
print(f'''generating report: {repr(report_path)}''')
profile.to_file(report_path)

pandas-profiling version

v4.6.4

Dependencies

# NOTE: Python 3.11.3
aiohttp==3.8.4
aiosignal==1.3.1
alabaster==0.7.13
annotated-types==0.6.0
ansi2html==1.8.0
anyio==4.1.0
appdirs==1.4.4
appnope==0.1.3
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
astor==0.8.1
asttokens==2.2.1
async-lru==2.0.4
async-timeout==4.0.2
attrdict==2.0.1
attrs==23.1.0
Babel==2.13.0
backcall==0.2.0
beautifulsoup4==4.12.2
bidict==0.22.1
bleach==6.1.0
blissful-basics==0.2.36
CacheControl==0.12.14
cachy==0.3.0
category-encoders==2.6.2
certifi==2023.7.22
cffi==1.15.1
charset-normalizer==3.2.0
cleo==1.0.0a5
click==8.1.7
cloudpickle==2.2.1
colorama==0.4.6
comm==0.1.4
contourpy==1.1.0
cool-cache==0.3.6
crashtest==0.3.1
cycler==0.11.0
Cython==3.0.0
dacite==1.8.1
dash==2.12.1
dash-bootstrap-components==1.5.0
dash-core-components==2.0.0
dash-html-components==2.0.0
dash-table==5.0.0
dask==2023.5.0
deap==1.4.1
debugpy==1.8.0
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.14
deprecation==2.1.0
distlib==0.3.7
distributed==2023.5.0
docstring-to-markdown==0.12
docutils==0.18.1
dulwich==0.20.50
engineering-notation==0.10.0
et-xmlfile==1.1.0
executing==1.2.0
-e git+ssh://git@github.com/jeff-hykin/ez_yaml.git@e08a2f7abfeac5ad8af1d23b68b99ae3525c93f4#egg=ez_yaml&subdirectory=main
fastjsonschema==2.18.0
file-system-py==0.0.11
filelock==3.12.2
Flask==2.2.5
Flask-Cors==4.0.0
fonttools==4.42.0
fqdn==1.5.1
frozenlist==1.3.3
fsspec==2023.10.0
gensim==4.3.2
gym==0.22.0
gym-notices==0.0.8
html5lib==1.1
htmlmin==0.1.12
idna==3.4
ImageHash==4.3.1
imagesize==1.4.1
imbalanced-learn==0.11.0
importlib-metadata==4.13.0
importlib-resources==6.0.1
informative-iterator==2.1.1
ipykernel==6.26.0
ipython==8.12.2
ipython-genutils==0.2.0
ipywidgets==7.6.5
isoduration==20.11.0
itsdangerous==2.1.2
jaraco.classes==3.3.0
jedi==0.19.0
Jinja2==3.1.2
joblib==1.3.2
-e git+ssh://git@github.com/jeff-hykin/json_fix.git@6303ca934b25bf72bec82b0f5ca1d282f4566543#egg=json_fix&subdirectory=main
json5==0.9.14
jsonpointer==2.4
jsonschema==4.19.0
jsonschema-specifications==2023.7.1
jupyter-events==0.9.0
jupyter-lsp==2.2.0
jupyter_client==8.6.0
jupyter_core==5.5.0
jupyter_server==2.10.1
jupyter_server_terminals==0.4.4
jupyterlab==4.0.9
jupyterlab-widgets==3.0.8
jupyterlab_pygments==0.3.0
jupyterlab_server==2.25.2
kaleido==0.2.1
keyring==24.2.0
kiwisolver==1.4.4
libsvm==3.23.0.4
llvmlite==0.40.1
locket==1.0.0
lockfile==0.12.2
MarkupSafe==2.1.3
matplotlib==3.7.2
matplotlib-inline==0.1.6
mistune==3.0.2
mne==1.5.1
more-itertools==10.1.0
mpmath==1.3.0
msgpack==1.0.5
multidict==6.0.4
multimethod==1.10
nbclient==0.9.0
nbconvert==7.11.0
nbformat==5.9.2
nest-asyncio==1.5.7
networkx==3.1
notebook==7.0.6
notebook_shim==0.2.3
numba==0.57.1
numpy==1.24.4
-e git+https://github.com/TAMU-Robomasters/cv_main@298ae48927c2d6644b4b3834039548c3c7694cc0#egg=opencv&subdirectory=repos/open_cv/modules/python/package
openpyxl==3.1.2
orjson==3.9.5
overrides==7.4.0
packaging==23.1
pandas==2.0.3
pandasgui==0.2.14
pandocfilters==1.5.0
parso==0.8.3
partd==1.4.1
patsy==0.5.3
pexpect==4.8.0
phik==0.12.3
pickleshare==0.7.5
Pillow==10.0.0
pkginfo==1.9.6
pkgutil_resolve_name==1.3.10
platformdirs==2.6.2
plotly==5.16.1
plotly-utils @ git+https://github.com/SengerM/plotly_utils@5f7e724d16d3ce7aa8282613220474bd2fcb90e5
pluggy==1.0.0
pmdarima==2.0.3
poetry==1.2.1
poetry-core==1.2.0
poetry-plugin-export==1.1.2
pooch==1.8.0
pretty-errors==1.2.25
prometheus-client==0.19.0
prompt-toolkit==3.0.39
psutil==5.9.5
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==14.0.1
pycairo==1.23.0
pycparser==2.21
pydantic==2.5.3
pydantic_core==2.14.6
Pygments==2.16.1
PyGObject==3.44.1
pylev==1.4.0
pynput==1.7.6
pyobjc-core==10.0
pyobjc-framework-ApplicationServices==10.0
pyobjc-framework-Cocoa==10.0
pyobjc-framework-Quartz==10.0
pyod==1.1.0
pyparsing==3.0.9
Pypubsub==4.0.3
PyQt5==5.15.10
PyQt5-Qt5==5.15.11
PyQt5-sip==12.13.0
PyQtWebEngine==5.15.6
PyQtWebEngine-Qt5==5.15.11
python-dateutil==2.8.2
python-engineio==4.4.1
python-json-logger==2.0.7
python-lsp-jsonrpc==1.0.0
python-lsp-server==1.7.3
python-socketio==5.8.0
pytz==2023.3
pytz-deprecation-shim==0.1.0.post0
PyWavelets==1.5.0
PyYAML==6.0.1
pyzmq==25.1.1
qgrid==1.3.1
qtstylish==0.1.5
quik-config==1.7.7
referencing==0.30.2
requests==2.31.0
requests-toolbelt==0.9.1
retrying==1.3.4
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.9.2
rpy2==3.5.12
schemdraw==0.15
scikit-base==0.5.1
scikit-learn==1.1.3
scikit-plot==0.3.7
scipy==1.10.1
seaborn==0.11.2
Send2Trash==1.8.2
shellingham==1.5.3
silver-spectacle==0.8.0
simplejson==3.19.2
six==1.16.0
sktime==0.21.1
slick-siphon==0.1.2
smart-open==6.4.0
sniffio==1.3.0
snowballstemmer==2.2.0
sortedcontainers==2.4.0
soupsieve==2.5
Sphinx==7.2.6
sphinx-rtd-theme==1.3.0
sphinxcontrib-applehelp==1.0.7
sphinxcontrib-devhelp==1.0.5
sphinxcontrib-htmlhelp==2.0.4
sphinxcontrib-jquery==4.1
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.6
sphinxcontrib-serializinghtml==1.1.9
stack-data==0.6.2
statsmodels==0.14.0
stopit==1.1.2
super-hash==1.2.8
sympy==1.12
tabloo==0.1.0
tangled-up-in-unicode==0.2.0
tbats==1.1.3
tblib==2.0.0
telegram-notifier==0.3
telepy-notify==0.2.1
tenacity==8.2.3
terminado==0.18.0
threadpoolctl==3.2.0
tinycss2==1.2.1
toml==0.10.2
tomlkit==0.12.1
toolz==0.12.0
torch==2.1.1
tornado==6.3.3
TPOT==0.12.1
tqdm==4.66.1
trace-updater==0.0.9.1
traitlets==5.9.0
-e git+ssh://git@github.com/ioerger2/transit2.git@5bcebc1d742b61b2def33d9611a9179f3e71fd9a#egg=transit2
tsdownsample==0.1.2
typeguard==4.1.5
types-python-dateutil==2.8.19.14
typing_extensions==4.7.1
tzdata==2023.3
tzlocal==4.3
ujson==5.7.0
update-checker==0.18.0
uri-template==1.3.0
urllib3==1.26.16
virtualenv==20.21.1
visions==0.7.5
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.6.4
Werkzeug==2.2.3
widgetsnbextension==3.5.2
wordcloud==1.9.2
wrapt==1.15.0
wurlitzer==3.0.3
wxPython==4.2.1
xattr==0.9.9
xgboost==1.7.6
xxhash==3.3.0
yarl==1.9.2
ydata-profiling==4.6.4
yellowbrick==1.5
zict==3.0.0
zipp==3.16.2

OS

MacOS 12.6 (Monterey) Apple Silicon

Checklist

jeff-hykin commented 6 months ago

(also I know its dumb to read a whole file then use StringIO, the code was simplified for the issue)