Open boris-kogan opened 1 year ago
I was able to replicate even for pandas >= 2. Since you not only found the issue but also the solution, would you like to submit the PR with the fix @boris-kogan?
Sure. I will create PR
On Tue, Aug 8, 2023 at 10:43 PM Alex Barros @.***> wrote:
I was able to replicate even for pandas >= 2. Since you not only found the issue but also the solution, would you like to submit the PR with the fix @boris-kogan https://github.com/boris-kogan?
— Reply to this email directly, view it on GitHub https://github.com/ydataai/ydata-profiling/issues/1384#issuecomment-1670208669, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBJVYEJB45FLW7IDEQLXXHDXUKJFNANCNFSM6AAAAAA2L4SGBI . You are receiving this because you were mentioned.Message ID: @.***>
Current Behaviour
Dataset with categorical features causes memory error even on tiny dataset.
File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/profile_report.py", line 439, in _render_json description = self.description_set File "/usr/local/lib/python3.9/dist-packages/typeguard/init.py", line 1033, in wrapper retval = func(*args, kwargs) File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/profile_report.py", line 245, in description_set self._description_set = describe_df( File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/model/describe.py", line 151, in describe metrics, duplicates = progress(get_duplicates, pbar, "Detecting duplicates")( File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/utils/progress_bar.py", line 11, in inner ret = fn(*args, *kwargs) File "/usr/local/lib/python3.9/dist-packages/multimethod/init.py", line 315, in call return func(args, kwargs) File "/usr/local/lib/python3.9/dist-packages/ydata_profiling/model/pandas/duplicates_pandas.py", line 37, in pandas_get_duplicates df[duplicated_rows] File "/usr/local/lib/python3.9/dist-packages/pandas/core/groupby/groupby.py", line 2411, in size return self._reindex_output(result, fill_value=0) File "/usr/local/lib/python3.9/dist-packages/pandas/core/groupby/groupby.py", line 4143, in _reindexoutput index, = MultiIndex.from_product(levels_list, names=names).sortlevel() File "/usr/local/lib/python3.9/dist-packages/pandas/core/indexes/multi.py", line 643, in from_product codes = cartesian_product(codes) File "/usr/local/lib/python3.9/dist-packages/pandas/core/reshape/util.py", line 60, in cartesian_product return [ File "/usr/local/lib/python3.9/dist-packages/pandas/core/reshape/util.py", line 62, in
np.repeat(x, b[i]),
File "<__array_function__ internals>", line 180, in repeat
File "/usr/local/lib/python3.9/dist-packages/numpy/core/fromnumeric.py", line 479, in repeat
return _wrapfunc(a, 'repeat', repeats, axis=axis)
File "/usr/local/lib/python3.9/dist-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
return bound(*args, **kwds)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 389. GiB for an array with shape (418190868480,) and data type int8
Expected Behaviour
Current ydata-profile code: duplicated_rows = ( df[duplicated_rows] .groupby(supported_columns, dropna=False) .size() .reset_index(name=duplicates_key) )
Should be: duplicated_rows = ( df[duplicated_rows] .groupby(supported_columns, dropna=False, observed=True) .size() .reset_index(name=duplicates_key) )
Please pay attention to this issue that explains the solution: https://github.com/pandas-dev/pandas/issues/30552
Data Description
10x10 random strings pandas dataset, with type specified as Category
Code that reproduces the bug
pandas-profiling version
v4.0.0
Dependencies
OS
ubuntu
Checklist