ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.53k stars 1.68k forks source link

Dataset with categorical features causes memory error even on tiny dataset. #1384

Open boris-kogan opened 1 year ago

boris-kogan commented 1 year ago

Current Behaviour

Dataset with categorical features causes memory error even on tiny dataset.

File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/profile_report.py", line 439, in _render_json description = self.description_set File "/usr/local/lib/python3.9/dist-packages/typeguard/init.py", line 1033, in wrapper retval = func(*args, kwargs) File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/profile_report.py", line 245, in description_set self._description_set = describe_df( File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/model/describe.py", line 151, in describe metrics, duplicates = progress(get_duplicates, pbar, "Detecting duplicates")( File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/utils/progress_bar.py", line 11, in inner ret = fn(*args, *kwargs) File "/usr/local/lib/python3.9/dist-packages/multimethod/init.py", line 315, in call return func(args, kwargs) File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/model/pandas/duplicates_pandas.py", line 37, in pandas_get_duplicates df[duplicated_rows] File "/usr/local/lib/python3.9/dist-packages/pandas/core/groupby/groupby.py", line 2411, in size return self._reindex_output(result, fill_value=0) File "/usr/local/lib/python3.9/dist-packages/pandas/core/groupby/groupby.py", line 4143, in _reindexoutput index, = MultiIndex.from_product(levels_list, names=names).sortlevel() File "/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/multi.py", line 643, in from_product codes = cartesian_product(codes) File "/usr/local/lib/python3.9/dist-packages/pandas/core/reshape/util.py", line 60, in cartesian_product return [ File "/usr/local/lib/python3.9/dist-packages/pandas/core/reshape/util.py", line 62, in np.repeat(x, b[i]), File "<__array_function__ internals>", line 180, in repeat File "/usr/local/lib/python3.9/dist-packages/numpy/core/fromnumeric.py", line 479, in repeat return _wrapfunc(a, 'repeat', repeats, axis=axis) File "/usr/local/lib/python3.9/dist-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc return bound(*args, **kwds) numpy.core._exceptions._ArrayMemoryError: Unable to allocate 389. GiB for an array with shape (418190868480,) and data type int8

Expected Behaviour

Current ydata-profile code: duplicated_rows = ( df[duplicated_rows] .groupby(supported_columns, dropna=False) .size() .reset_index(name=duplicates_key) )

Should be: duplicated_rows = ( df[duplicated_rows] .groupby(supported_columns, dropna=False, observed=True) .size() .reset_index(name=duplicates_key) )

Please pay attention to this issue that explains the solution: https://github.com/pandas-dev/pandas/issues/30552

Data Description

10x10 random strings pandas dataset, with type specified as Category

Code that reproduces the bug

import pandas as pd
from ydata_profiling import ProfileReport
df = pd.DataFrame(data=pd.util.testing.rands_array(10, size=(10, 10)), dtype="category")

report = ProfileReport(df, title="Profiling Report")

display(report)

pandas-profiling version

v4.0.0

Dependencies

pandas==1.3.4

OS

ubuntu

Checklist

alexbarros commented 1 year ago

I was able to replicate even for pandas >= 2. Since you not only found the issue but also the solution, would you like to submit the PR with the fix @boris-kogan?

boris-kogan commented 1 year ago

Sure. I will create PR

On Tue, Aug 8, 2023 at 10:43 PM Alex Barros @.***> wrote:

I was able to replicate even for pandas >= 2. Since you not only found the issue but also the solution, would you like to submit the PR with the fix @boris-kogan https://github.com/boris-kogan?

— Reply to this email directly, view it on GitHub https://github.com/ydataai/ydata-profiling/issues/1384#issuecomment-1670208669, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBJVYEJB45FLW7IDEQLXXHDXUKJFNANCNFSM6AAAAAA2L4SGBI . You are receiving this because you were mentioned.Message ID: @.***>