ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.38k stars 1.67k forks source link

memory leak in histogram with 'auto' bins #1330

Open eromoe opened 1 year ago

eromoe commented 1 year ago

Current Behaviour

I saw this pr had been merged https://github.com/ydataai/ydata-profiling/pull/1308

But I still got memory error:

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_75412\2199735439.py in 
      1 r = ProfileReport(df, title='fina_price')
----> 2 r.to_file('fina_price.html')

c:\Users\ufo\anaconda3\lib\site-packages\typeguard\__init__.py in wrapper(*args, **kwargs)
   1031         memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
   1032         check_argument_types(memo)
-> 1033         retval = func(*args, **kwargs)
   1034         try:
   1035             check_return_type(retval, memo)

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\profile_report.py in to_file(self, output_file, silent)
    350                 create_html_assets(self.config, output_file)
    351 
--> 352             data = self.to_html()
    353 
    354             if output_file.suffix != ".html":

c:\Users\ufo\anaconda3\lib\site-packages\typeguard\__init__.py in wrapper(*args, **kwargs)
   1031         memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
   1032         check_argument_types(memo)
-> 1033         retval = func(*args, **kwargs)
   1034         try:
   1035             check_return_type(retval, memo)

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\profile_report.py in to_html(self)
    463 
    464         """
--> 465         return self.html
    466 
    467     def to_json(self) -> str:

c:\Users\ufo\anaconda3\lib\site-packages\typeguard\__init__.py in wrapper(*args, **kwargs)
   1031         memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
   1032         check_argument_types(memo)
-> 1033         retval = func(*args, **kwargs)
   1034         try:
   1035             check_return_type(retval, memo)

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\profile_report.py in html(self)
    272     def html(self) -> str:
    273         if self._html is None:
--> 274             self._html = self._render_html()
    275         return self._html
    276 

c:\Users\ufo\anaconda3\lib\site-packages\typeguard\__init__.py in wrapper(*args, **kwargs)
   1031         memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
   1032         check_argument_types(memo)
-> 1033         retval = func(*args, **kwargs)
   1034         try:
   1035             check_return_type(retval, memo)

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\profile_report.py in _render_html(self)
    380         from ydata_profiling.report.presentation.flavours import HTMLReport
    381 
--> 382         report = self.report
    383 
    384         with tqdm(

c:\Users\ufo\anaconda3\lib\site-packages\typeguard\__init__.py in wrapper(*args, **kwargs)
   1031         memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
   1032         check_argument_types(memo)
-> 1033         retval = func(*args, **kwargs)
   1034         try:
   1035             check_return_type(retval, memo)

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\profile_report.py in report(self)
    266     def report(self) -> Root:
    267         if self._report is None:
--> 268             self._report = get_report_structure(self.config, self.description_set)
    269         return self._report
    270 

c:\Users\ufo\anaconda3\lib\site-packages\typeguard\__init__.py in wrapper(*args, **kwargs)
   1031         memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
   1032         check_argument_types(memo)
-> 1033         retval = func(*args, **kwargs)
   1034         try:
   1035             check_return_type(retval, memo)

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\profile_report.py in description_set(self)
    248     def description_set(self) -> BaseDescription:
    249         if self._description_set is None:
--> 250             self._description_set = describe_df(
    251                 self.config,
    252                 self.df,

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\model\describe.py in describe(config, df, summarizer, typeset, sample)
     70         # Variable-specific
     71         pbar.total += len(df.columns)
---> 72         series_description = get_series_descriptions(
     73             config, df, summarizer, typeset, pbar
     74         )

c:\Users\ufo\anaconda3\lib\site-packages\multimethod\__init__.py in __call__(self, *args, **kwargs)
    182     def __call__(self, *args, **kwargs):
    183         """Resolve and dispatch to best method."""
--> 184         return self[tuple(map(self.get_type, args))](*args, **kwargs)
    185 
    186     def evaluate(self):

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\model\pandas\summary_pandas.py in pandas_get_series_descriptions(config, df, summarizer, typeset, pbar)
     97         # TODO: use `Pool` for Linux-based systems
     98         with multiprocessing.pool.ThreadPool(pool_size) as executor:
---> 99             for i, (column, description) in enumerate(
    100                 executor.imap_unordered(multiprocess_1d, args)
    101             ):

c:\Users\ufo\anaconda3\lib\multiprocessing\pool.py in next(self, timeout)
    868         if success:
    869             return value
--> 870         raise value
    871 
    872     __next__ = next                    # XXX

c:\Users\ufo\anaconda3\lib\multiprocessing\pool.py in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception)
    123         job, i, func, args, kwds = task
    124         try:
--> 125             result = (True, func(*args, **kwds))
    126         except Exception as e:
    127             if wrap_exception and func is not _helper_reraises_exception:

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\model\pandas\summary_pandas.py in multiprocess_1d(args)
     77         """
     78         column, series = args
---> 79         return column, describe_1d(config, series, summarizer, typeset)
     80 
     81     pool_size = config.pool_size

c:\Users\ufo\anaconda3\lib\site-packages\multimethod\__init__.py in __call__(self, *args, **kwargs)
    182     def __call__(self, *args, **kwargs):
    183         """Resolve and dispatch to best method."""
--> 184         return self[tuple(map(self.get_type, args))](*args, **kwargs)
    185 
    186     def evaluate(self):

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\model\pandas\summary_pandas.py in pandas_describe_1d(config, series, summarizer, typeset)
     55 
     56     typeset.type_schema[series.name] = vtype
---> 57     return summarizer.summarize(config, series, dtype=vtype)
     58 
     59 

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\model\summarizer.py in summarize(self, config, series, dtype)
     40             object:
     41         """
---> 42         _, _, summary = self.handle(str(dtype), config, series, {"type": str(dtype)})
     43         return summary
     44 

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\model\handler.py in handle(self, dtype, *args, **kwargs)
     60         funcs = self.mapping.get(dtype, [])
     61         op = compose(funcs)
---> 62         return op(*args)
     63 
     64 

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\model\handler.py in func2(*x)
     19                 return f(*x)
     20             else:
---> 21                 return f(*res)
     22 
     23         return func2

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\model\handler.py in func2(*x)
     19                 return f(*x)
     20             else:
---> 21                 return f(*res)
     22 
     23         return func2

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\model\handler.py in func2(*x)
     19                 return f(*x)
     20             else:
---> 21                 return f(*res)
     22 
     23         return func2

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\model\handler.py in func2(*x)
     15     def func(f: Callable, g: Callable) -> Callable:
     16         def func2(*x) -> Any:
---> 17             res = g(*x)
     18             if type(res) == bool:
     19                 return f(*x)

c:\Users\ufo\anaconda3\lib\site-packages\multimethod\__init__.py in __call__(self, *args, **kwargs)
    182     def __call__(self, *args, **kwargs):
    183         """Resolve and dispatch to best method."""
--> 184         return self[tuple(map(self.get_type, args))](*args, **kwargs)
    185 
    186     def evaluate(self):

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\model\summary_algorithms.py in inner(config, series, summary)
     66         if not summary["hashable"]:
     67             return config, series, summary
---> 68         return fn(config, series, summary)
     69 
     70     return inner

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\model\summary_algorithms.py in inner(config, series, summary)
     83             series = series.dropna()
     84 
---> 85         return fn(config, series, summary)
     86 
     87     return inner

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\model\pandas\describe_numeric_pandas.py in pandas_describe_numeric_1d(config, series, summary)
    118 
    119     if chi_squared_threshold > 0.0:
--> 120         stats["chi_squared"] = chi_square(finite_values)
    121 
    122     stats["range"] = stats["max"] - stats["min"]

E:\Workspace\github_me\ydata-profiling\src\ydata_profiling\model\summary_algorithms.py in chi_square(values, histogram)
     52 ) -> dict:
     53     if histogram is None:
---> 54         bins = np.histogram_bin_edges(values, bins="auto")
     55         histogram, _ = np.histogram(values, bins=bins)
     56     return dict(chisquare(histogram)._asdict())

c:\Users\ufo\anaconda3\lib\site-packages\numpy\core\overrides.py in histogram_bin_edges(*args, **kwargs)

c:\Users\ufo\anaconda3\lib\site-packages\numpy\lib\histograms.py in histogram_bin_edges(a, bins, range, weights)
    667     """
    668     a, weights = _ravel_and_check_weights(a, weights)
--> 669     bin_edges, _ = _get_bin_edges(a, bins, range, weights)
    670     return bin_edges
    671 

c:\Users\ufo\anaconda3\lib\site-packages\numpy\lib\histograms.py in _get_bin_edges(a, bins, range, weights)
    444 
    445         # bin edges must be computed
--> 446         bin_edges = np.linspace(
    447             first_edge, last_edge, n_equal_bins + 1,
    448             endpoint=True, dtype=bin_type)

c:\Users\ufo\anaconda3\lib\site-packages\numpy\core\overrides.py in linspace(*args, **kwargs)

c:\Users\ufo\anaconda3\lib\site-packages\numpy\core\function_base.py in linspace(start, stop, num, endpoint, retstep, dtype, axis)
    133 
    134     delta = stop - start
--> 135     y = _nx.arange(0, num, dtype=dt).reshape((-1,) + (1,) * ndim(delta))
    136     # In-place multiplication y *= delta/div is faster, but prevents the multiplicant
    137     # from overriding what class is produced, and thus prevents, e.g. use of Quantities,

MemoryError: Unable to allocate 1.57 TiB for an array with shape (215683337916,) and data type float64

Expected Behaviour

no error

Data Description

not avaliable

Code that reproduces the bug

No response

pandas-profiling version

latest develop

Dependencies

pandas  1.5.3
numpy  1.23.5

OS

win10

Checklist

fabclmnt commented 1 year ago

Hi @eromoe,

can you please provide more details on your environment? Python version, ydata-profiling ,etc?

Based on the error you have provided it does seem to be a different issues from #1308 .

eromoe commented 1 year ago

Python 3.9.16 pandas 1.5.3 numpy 1.23.5 ydata-profiling develop(2023-05-18)

fabclmnt commented 1 year ago

Hi @eromoe, may I suggest that you use a version that is not under development? Development version do not ensure that everything is 100% functional. We are currently working in a new release with update of major packages which can impact the experience of the package.

Based on the details provided, everything points to lack of memory in your machine.

Have you tried one of the following strategies:

If you can share the size of your data that would be appreciated as well.

eromoe commented 1 year ago

@fabclmnt

  1. I meet momery leak, so found this pr https://github.com/ydataai/ydata-profiling/pull/1308
  2. I switch to develop to make sure above pr is working
  3. But momery leak still exist

So I come up this issue, that pr doesn't fix the memory leak.


Have you tried one of the following strategies:

  • Convert your float variables to float 32 or 16 depending on the precision that you require? Similar for integers
  • If you have timestamps double check the precision that you require

I would take a try next week.

fabclmnt commented 1 year ago

The issue seems to be fixed after that PR you've linked - we have validated.

Hence why I'm asking as well the size of our data.

eromoe commented 1 year ago

I tried to convert data to np.float16 , but still same error . There is the testing dataset: https://mega.nz/file/xZ1xyIBL#0BM_WghcbQTJO6E1N4wpEeoESwxTr696UBlc85SnmlA

fabclmnt commented 1 year ago

@eromoe please confirm what is your system computational power.

And please provide the code you are using to compute the profiling.

eromoe commented 1 year ago

No complicate, just load the file and use ProfileReport

r = ProfileReport(df, title='fina_price')
r.to_file('fina_price.html')

You see the error message, it require over 1T memory , no matter of system computational power. I have 12 core amd cpu and 64G ram.

aquemy commented 1 year ago

Hi,

It is not a memory leak. It fails on requested 1T memory and the most likely explanation is that you have categorical features with extremely high cardinality. Chi Square will compute a contingency table with dimension n x n where n is the number of different categories.

Try to remove the first column which is a unique index and brings no value. With this column, it will request for the Chi Square a matrux of size 200k x 200k which indeed might require around 1T memory.

eromoe commented 1 year ago

Sorry, I wanted to make sure you tested the file I sent? There is no categorical features in my dataset. The first column is index of original dataframe, you can drop it, and no column have high cardinality ( I mean all columns are numerical, so there is no cardinality ) . I have used datatile to reconfirm this:

image