ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.38k stars 1.67k forks source link

MemoryError for particular input WITHOUT large outliers #1412

Open skunkyevil opened 1 year ago

skunkyevil commented 1 year ago

Current Behaviour

I came across very weird MemoryError when trying to build profile on particular dataframe:

Error traceback is rather long, click to expand ``` MemoryError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_2680/3177531553.py in 4 buggy_df = pd.read_pickle('buggy_df.pkl') 5 original_report = ProfileReport(buggy_df) ----> 6 original_report.to_file("try6.html") C:\Anaconda3\envs\work_pip2\lib\site-packages\typeguard\__init__.py in wrapper(*args, **kwargs) 1031 memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs) 1032 check_argument_types(memo) -> 1033 retval = func(*args, **kwargs) 1034 try: 1035 check_return_type(retval, memo) C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\profile_report.py in to_file(self, output_file, silent) 307 create_html_assets(self.config, output_file) 308 --> 309 data = self.to_html() 310 311 if output_file.suffix != ".html": C:\Anaconda3\envs\work_pip2\lib\site-packages\typeguard\__init__.py in wrapper(*args, **kwargs) 1031 memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs) 1032 check_argument_types(memo) -> 1033 retval = func(*args, **kwargs) 1034 try: 1035 check_return_type(retval, memo) C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\profile_report.py in to_html(self) 418 419 """ --> 420 return self.html 421 422 def to_json(self) -> str: C:\Anaconda3\envs\work_pip2\lib\site-packages\typeguard\__init__.py in wrapper(*args, **kwargs) 1031 memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs) 1032 check_argument_types(memo) -> 1033 retval = func(*args, **kwargs) 1034 try: 1035 check_return_type(retval, memo) C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\profile_report.py in html(self) 229 def html(self) -> str: 230 if self._html is None: --> 231 self._html = self._render_html() 232 return self._html 233 C:\Anaconda3\envs\work_pip2\lib\site-packages\typeguard\__init__.py in wrapper(*args, **kwargs) 1031 memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs) 1032 check_argument_types(memo) -> 1033 retval = func(*args, **kwargs) 1034 try: 1035 check_return_type(retval, memo) C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\profile_report.py in _render_html(self) 337 from pandas_profiling.report.presentation.flavours import HTMLReport 338 --> 339 report = self.report 340 341 with tqdm( C:\Anaconda3\envs\work_pip2\lib\site-packages\typeguard\__init__.py in wrapper(*args, **kwargs) 1031 memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs) 1032 check_argument_types(memo) -> 1033 retval = func(*args, **kwargs) 1034 try: 1035 check_return_type(retval, memo) C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\profile_report.py in report(self) 223 def report(self) -> Root: 224 if self._report is None: --> 225 self._report = get_report_structure(self.config, self.description_set) 226 return self._report 227 C:\Anaconda3\envs\work_pip2\lib\site-packages\typeguard\__init__.py in wrapper(*args, **kwargs) 1031 memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs) 1032 check_argument_types(memo) -> 1033 retval = func(*args, **kwargs) 1034 try: 1035 check_return_type(retval, memo) C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\profile_report.py in description_set(self) 205 def description_set(self) -> Dict[str, Any]: 206 if self._description_set is None: --> 207 self._description_set = describe_df( 208 self.config, 209 self.df, C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\model\describe.py in describe(config, df, summarizer, typeset, sample) 69 # Variable-specific 70 pbar.total += len(df.columns) ---> 71 series_description = get_series_descriptions( 72 config, df, summarizer, typeset, pbar 73 ) C:\Anaconda3\envs\work_pip2\lib\site-packages\multimethod\__init__.py in __call__(self, *args, **kwargs) 313 func = self[tuple(func(arg) for func, arg in zip(self.type_checkers, args))] 314 try: --> 315 return func(*args, **kwargs) 316 except TypeError as ex: 317 raise DispatchError(f"Function {func.__code__}") from ex C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\model\pandas\summary_pandas.py in pandas_get_series_descriptions(config, df, summarizer, typeset, pbar) 90 # TODO: use `Pool` for Linux-based systems 91 with multiprocessing.pool.ThreadPool(pool_size) as executor: ---> 92 for i, (column, description) in enumerate( 93 executor.imap_unordered(multiprocess_1d, args) 94 ): C:\Anaconda3\envs\work_pip2\lib\multiprocessing\pool.py in next(self, timeout) 868 if success: 869 return value --> 870 raise value 871 872 __next__ = next # XXX C:\Anaconda3\envs\work_pip2\lib\multiprocessing\pool.py in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception) 123 job, i, func, args, kwds = task 124 try: --> 125 result = (True, func(*args, **kwds)) 126 except Exception as e: 127 if wrap_exception and func is not _helper_reraises_exception: C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\model\pandas\summary_pandas.py in multiprocess_1d(args) 70 """ 71 column, series = args ---> 72 return column, describe_1d(config, series, summarizer, typeset) 73 74 pool_size = config.pool_size C:\Anaconda3\envs\work_pip2\lib\site-packages\multimethod\__init__.py in __call__(self, *args, **kwargs) 313 func = self[tuple(func(arg) for func, arg in zip(self.type_checkers, args))] 314 try: --> 315 return func(*args, **kwargs) 316 except TypeError as ex: 317 raise DispatchError(f"Function {func.__code__}") from ex C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\model\pandas\summary_pandas.py in pandas_describe_1d(config, series, summarizer, typeset) 48 vtype = typeset.detect_type(series) 49 ---> 50 return summarizer.summarize(config, series, dtype=vtype) 51 52 C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\model\summarizer.py in summarize(self, config, series, dtype) 37 object: 38 """ ---> 39 _, _, summary = self.handle(str(dtype), config, series, {"type": str(dtype)}) 40 return summary 41 C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\model\handler.py in handle(self, dtype, *args, **kwargs) 60 funcs = self.mapping.get(dtype, []) 61 op = compose(funcs) ---> 62 return op(*args) 63 64 C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\model\handler.py in func2(*x) 19 return f(*x) 20 else: ---> 21 return f(*res) 22 23 return func2 C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\model\handler.py in func2(*x) 19 return f(*x) 20 else: ---> 21 return f(*res) 22 23 return func2 C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\model\handler.py in func2(*x) 19 return f(*x) 20 else: ---> 21 return f(*res) 22 23 return func2 C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\model\handler.py in func2(*x) 15 def func(f: Callable, g: Callable) -> Callable: 16 def func2(*x) -> Any: ---> 17 res = g(*x) 18 if type(res) == bool: 19 return f(*x) C:\Anaconda3\envs\work_pip2\lib\site-packages\multimethod\__init__.py in __call__(self, *args, **kwargs) 313 func = self[tuple(func(arg) for func, arg in zip(self.type_checkers, args))] 314 try: --> 315 return func(*args, **kwargs) 316 except TypeError as ex: 317 raise DispatchError(f"Function {func.__code__}") from ex C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\model\summary_algorithms.py in inner(config, series, summary) 63 if not summary["hashable"]: 64 return config, series, summary ---> 65 return fn(config, series, summary) 66 67 return inner C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\model\summary_algorithms.py in inner(config, series, summary) 80 series = series.dropna() 81 ---> 82 return fn(config, series, summary) 83 84 return inner C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\model\pandas\describe_numeric_pandas.py in pandas_describe_numeric_1d(config, series, summary) 118 119 if chi_squared_threshold > 0.0: --> 120 stats["chi_squared"] = chi_square(finite_values) 121 122 stats["range"] = stats["max"] - stats["min"] C:\Anaconda3\envs\work_pip2\lib\site-packages\pandas_profiling\model\summary_algorithms.py in chi_square(values, histogram) 50 ) -> dict: 51 if histogram is None: ---> 52 histogram, _ = np.histogram(values, bins="auto") 53 return dict(chisquare(histogram)._asdict()) 54 C:\Anaconda3\envs\work_pip2\lib\site-packages\numpy\core\overrides.py in histogram(*args, **kwargs) C:\Anaconda3\envs\work_pip2\lib\site-packages\numpy\lib\histograms.py in histogram(a, bins, range, normed, weights, density) 791 a, weights = _ravel_and_check_weights(a, weights) 792 --> 793 bin_edges, uniform_bins = _get_bin_edges(a, bins, range, weights) 794 795 # Histogram is an integer or a float array depending on the weights. C:\Anaconda3\envs\work_pip2\lib\site-packages\numpy\lib\histograms.py in _get_bin_edges(a, bins, range, weights) 444 445 # bin edges must be computed --> 446 bin_edges = np.linspace( 447 first_edge, last_edge, n_equal_bins + 1, 448 endpoint=True, dtype=bin_type) C:\Anaconda3\envs\work_pip2\lib\site-packages\numpy\core\overrides.py in linspace(*args, **kwargs) C:\Anaconda3\envs\work_pip2\lib\site-packages\numpy\core\function_base.py in linspace(start, stop, num, endpoint, retstep, dtype, axis) 133 134 delta = stop - start --> 135 y = _nx.arange(0, num, dtype=dt).reshape((-1,) + (1,) * ndim(delta)) 136 # In-place multiplication y *= delta/div is faster, but prevents the multiplicant 137 # from overriding what class is produced, and thus prevents, e.g. use of Quantities, MemoryError: Unable to allocate 9.14 PiB for an array with shape (1286991680162215,) and data type float64 ```

Expected Behaviour

It should generate a report

Data Description

Here is dataframe that caused the bug exported to csv: buggy_df.csv

I couldn't make it smaller, even splitting it into 2 parts result in normal processing for each part without errors

Code that reproduces the bug

from pandas_profiling import ProfileReport
#from ydata_profiling import ProfileReport - the same behavior

buggy_df = pd.read_csv('buggy_df.csv')
original_report = ProfileReport(buggy_df)
original_report.to_file("try6.html")

pandas-profiling version

v3.6.6

Dependencies

pandas==1.3.4
numpy==1.23.5
ydata_profiling==1.4.4

OS

Windows 10

Checklist

cytostatika commented 1 year ago

I have the same issue with pandas 1.4.3 and ydata-profiling 1.4.4 and numpy 1.23.5. Funnily enough I only get the same error when I filter some columns from my original dataset.

cytostatika commented 1 year ago

Adding a column with unique values for each row solved my problem. This will obviously not allow the profiling to find duplicate rows, but its better than not being able to get the report at all.

It probably has something to do the with the numpy.histogram bug and floats https://stackoverflow.com/questions/67342168/memoryerror-when-using-pandas-profiling-profile-report

skunkyevil commented 1 year ago

Actually this is indeed a np.histogram bug, the following code gives the same error with the same dataframe:

import numpy as np

buggy_df = pd.read_csv('buggy_df.csv')

a = buggy_df['str3d_net_profit_long'].dropna().to_numpy()
np.histogram(a, bins='auto')
fabclmnt commented 1 year ago

Hi @skunkyevil ,

from what you have reported you are using an older version of the package. We are currently in version 4.5.0 for ydata-profiling. Can you please let me know if the error remains?

Also can you please provide more details about your dataset, so we can have a better understanding?

Cheers.

skunkyevil commented 1 year ago

Hi @fabclmnt ,

Just checked with ydata-profiling version 4.5.0 - error still persist. I've included dataset in my initial post: buggy_df.csv

I think this issue thread does not deserve an effort to be considered separately instead of solving more general np.histogram bug. The only difference from general case is that my dataset does not contain huge outlies at all.