ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.52k stars 1.68k forks source link

BUGs with pandas-profiling utils package opening and preparing files for ProfileReport() #952

Closed richlysakowski closed 2 years ago

richlysakowski commented 2 years ago

Describe the bug

Error message from a fresh conda environment as suggested on the pandas-profiling documentation:

IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

The simple website example works...

import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

df = pd.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])
profile = ProfileReport(df, title="Pandas Profiling Report")
profile

but as soon as a real dataset is loaded pandas_profiling throws an error related to numpy.

To Reproduce

Download and save the dataset locally:

https://data.cityofchicago.org/api/views/xzkq-xp2w/rows.csv?accessType=DOWNLOAD

file_name = r'Current_Employee_Names__Salaries__and_Position_Titles.csv'

df = pd.read_csv(file_name)

profile = ProfileReport(df)
profile

Gives the following output:

It starts to analyze the dataset and then throws the following errors: Summarize dataset: 31% 4/13 [00:00<00:00, 9.22it/s, Describe variable:Job Titles]


IndexError Traceback (most recent call last) File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\IPython\core\formatters.py:343, in BaseFormatter.call(self, obj) 341 method = get_real_method(obj, self.print_method) 342 if method is not None: --> 343 return method() 344 return None 345 else:

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\profile_report.py:418, in ProfileReport._reprhtml(self) 416 def _reprhtml(self) -> None: 417 """The ipython notebook widgets user interface gets called by the jupyter notebook.""" --> 418 self.to_notebook_iframe()

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\profile_report.py:398, in ProfileReport.to_notebook_iframe(self) 396 with warnings.catch_warnings(): 397 warnings.simplefilter("ignore") --> 398 display(get_notebook_iframe(self.config, self))

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\report\presentation\flavours\widget\notebook.py:75, in get_notebook_iframe(config, profile) 73 output = get_notebook_iframe_src(config, profile) 74 elif attribute == IframeAttribute.srcdoc: ---> 75 output = get_notebook_iframe_srcdoc(config, profile) 76 else: 77 raise ValueError( 78 f'Iframe Attribute can be "src" or "srcdoc" (current: {attribute}).' 79 )

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\report\presentation\flavours\widget\notebook.py:29, in get_notebook_iframe_srcdoc(config, profile) 27 width = config.notebook.iframe.width 28 height = config.notebook.iframe.height ---> 29 src = html.escape(profile.to_html()) 31 iframe = f'' 33 return HTML(iframe)

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\profile_report.py:368, in ProfileReport.to_html(self) 360 def to_html(self) -> str: 361 """Generate and return complete template as lengthy string 362 for using with frameworks. 363 (...) 366 367 """ --> 368 return self.html

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\profile_report.py:185, in ProfileReport.html(self) 182 @property 183 def html(self) -> str: 184 if self._html is None: --> 185 self._html = self._render_html() 186 return self._html

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\profile_report.py:287, in ProfileReport._render_html(self) 284 def _render_html(self) -> str: 285 from pandas_profiling.report.presentation.flavours import HTMLReport --> 287 report = self.report 289 with tqdm( 290 total=1, desc="Render HTML", disable=not self.config.progress_bar 291 ) as pbar: 292 html = HTMLReport(copy.deepcopy(report)).render( 293 nav=self.config.html.navbar_show, 294 offline=self.config.html.use_local_assets, (...) 302 version=self.description_set["package"]["pandas_profiling_version"], 303 )

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\profile_report.py:179, in ProfileReport.report(self) 176 @property 177 def report(self) -> Root: 178 if self._report is None: --> 179 self._report = get_report_structure(self.config, self.description_set) 180 return self._report

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\profile_report.py:161, in ProfileReport.description_set(self) 158 @property 159 def description_set(self) -> Dict[str, Any]: 160 if self._description_set is None: --> 161 self._description_set = describe_df( 162 self.config, 163 self.df, 164 self.summarizer, 165 self.typeset, 166 self._sample, 167 ) 168 return self._description_set

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\model\describe.py:71, in describe(config, df, summarizer, typeset, sample) 69 # Variable-specific 70 pbar.total += len(df.columns) ---> 71 series_description = get_series_descriptions( 72 config, df, summarizer, typeset, pbar 73 ) 75 pbar.set_postfix_str("Get variable types") 76 pbar.total += 1

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\multimethod__init.py:184, in multimethod.call(self, *args, **kwargs) 182 def call__(self, *args, *kwargs): 183 """Resolve and dispatch to best method.""" --> 184 return self[tuple(map(self.get_type, args))](args, **kwargs)

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\model\pandas\summary_pandas.py:92, in pandas_get_series_descriptions(config, df, summarizer, typeset, pbar) 89 else: 90 # TODO: use Pool for Linux-based systems 91 with multiprocessing.pool.ThreadPool(pool_size) as executor: ---> 92 for i, (column, description) in enumerate( 93 executor.imap_unordered(multiprocess_1d, args) 94 ): 95 pbar.set_postfix_str(f"Describe variable:{column}") 96 series_description[column] = description

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\multiprocessing\pool.py:868, in IMapIterator.next(self, timeout) 866 if success: 867 return value --> 868 raise value

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\multiprocessing\pool.py:125, in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception) 123 job, i, func, args, kwds = task 124 try: --> 125 result = (True, func(*args, **kwds)) 126 except Exception as e: 127 if wrap_exception and func is not _helper_reraises_exception:

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\model\pandas\summary_pandas.py:72, in pandas_get_series_descriptions..multiprocess_1d(args) 62 """Wrapper to process series in parallel. 63 64 Args: (...) 69 A tuple with column and the series description. 70 """ 71 column, series = args ---> 72 return column, describe_1d(config, series, summarizer, typeset)

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\multimethod__init.py:184, in multimethod.call(self, *args, **kwargs) 182 def call__(self, *args, *kwargs): 183 """Resolve and dispatch to best method.""" --> 184 return self[tuple(map(self.get_type, args))](args, **kwargs)

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\model\pandas\summary_pandas.py:50, in pandas_describe_1d(config, series, summarizer, typeset) 45 else: 46 # Detect variable types from pandas dataframe (df.dtypes). 47 # [new dtypes, changed using astype function are now considered] 48 vtype = typeset.detect_type(series) ---> 50 return summarizer.summarize(config, series, dtype=vtype)

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandasprofiling\model\summarizer.py:37, in BaseSummarizer.summarize(self, config, series, dtype) 29 def summarize( 30 self, config: Settings, series: pd.Series, dtype: Type[VisionsBaseType] 31 ) -> dict: 32 """ 33 34 Returns: 35 object: 36 """ ---> 37 , _, summary = self.handle(str(dtype), config, series, {"type": str(dtype)}) 38 return summary

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\model\handler.py:62, in Handler.handle(self, dtype, *args, *kwargs) 60 funcs = self.mapping.get(dtype, []) 61 op = compose(funcs) ---> 62 return op(args)

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\model\handler.py:21, in compose..func..func2(x) 19 return f(x) 20 else: ---> 21 return f(*res)

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\model\handler.py:21, in compose..func..func2(x) 19 return f(x) 20 else: ---> 21 return f(*res)

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\model\handler.py:21, in compose..func..func2(x) 19 return f(x) 20 else: ---> 21 return f(*res)

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\model\handler.py:17, in compose..func..func2(x) 16 def func2(x) -> Any: ---> 17 res = g(x) 18 if type(res) == bool: 19 return f(x)

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\multimethod__init.py:184, in multimethod.call(self, *args, **kwargs) 182 def call__(self, *args, *kwargs): 183 """Resolve and dispatch to best method.""" --> 184 return self[tuple(map(self.get_type, args))](args, **kwargs)

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\model\summary_algorithms.py:65, in series_hashable..inner(config, series, summary) 63 if not summary["hashable"]: 64 return config, series, summary ---> 65 return fn(config, series, summary)

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\model\summary_algorithms.py:82, in series_handle_nulls..inner(config, series, summary) 79 if series.hasnans: 80 series = series.dropna() ---> 82 return fn(config, series, summary)

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\model\pandas\describe_categorical_pandas.py:205, in pandas_describe_categorical_1d(config, series, summary) 202 summary["chi_squared"] = chi_square(histogram=value_counts.values) 204 if config.vars.cat.length: --> 205 summary.update(length_summary_vc(value_counts)) 206 summary.update( 207 histogram_compute( 208 config, (...) 213 ) 214 ) 216 if config.vars.cat.characters:

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\model\pandas\describe_categorical_pandas.py:162, in length_summary_vc(vc) 156 length_counts = length_counts.groupby(level=0, sort=False).sum() 157 length_counts = length_counts.sort_values(ascending=False) 159 summary = { 160 "max_length": np.max(length_counts.index), 161 "mean_length": np.average(length_counts.index, weights=length_counts.values), --> 162 "median_length": weighted_median( 163 length_counts.index.values, weights=length_counts.values 164 ), 165 "min_length": np.min(length_counts.index), 166 "length_histogram": length_counts, 167 } 169 return summary

File C:\Programdata\Anaconda3\envs\pandas-profiling\lib\site-packages\pandas_profiling\model\pandas\utils_pandas.py:13, in weighted_median(data, weights) 11 midpoint = 0.5 * sum(s_weights) 12 if any(weights > midpoint): ---> 13 w_median = (data[weights == np.max(weights)])[0] 14 else: 15 cs_weights = np.cumsum(s_weights)

IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

_Code:_ Preferably, use this code format:
```python
"""
Test for issue XXX:
https://github.com/pandas-profiling/pandas-profiling/issues/XXX
"""
import pandas as pd
import pandas_profiling

def test_issueXXX():
    df = pd.read_csv(r"<file>")

    # Minimal reproducible code

-->

Version information:

Additional context

hgoldman5959 commented 2 years ago

Running a ProfileReport on a simple DataFrame (165,26) I am getting this error consistently, but on random fields

aronvandepol commented 2 years ago

Seems to be the same issue as #911.

hgoldman5959 commented 2 years ago

Thank you I will give the minimal=False a try

Harry Goldman, MSPA

From: Aron @. Sent: Friday, April 15, 2022 5:05 PM To: ydataai/pandas-profiling @.> Cc: Harry Goldman @.>; Comment @.> Subject: Re: [ydataai/pandas-profiling] BUGs with pandas-profiling utils package opening and preparing files for ProfileReport() (Issue #952)

Seems to be the same issue as #911 https://github.com/ydataai/pandas-profiling/issues/911 .

— Reply to this email directly, view it on GitHub https://github.com/ydataai/pandas-profiling/issues/952#issuecomment-1100394113 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AKLPYRQAKJ6YU5VU7UCDFPTVFHKWTANCNFSM5TAD5TDQ . You are receiving this because you commented. https://github.com/notifications/beacon/AKLPYRXQJSO2WSUPFL7U6HLVFHKWTA5CNFSM5TAD5TD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOIGLK5AI.gif Message ID: @. @.> >

sbrugman commented 2 years ago

Duplicate of #911 and #954