ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.37k stars 1.67k forks source link

ProfileReport produces error output when using arg minimal=False #923

Closed rat-nick closed 2 years ago

rat-nick commented 2 years ago

Describe the bug Calling ProfileReport with the argument minimal=False or its default value produces an error output, while using minimal=True produces normal output.

To Reproduce

DataFrame structure: ``` RangeIndex: 211231 entries, 0 to 211230 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user 211231 non-null int64 1 film 211231 non-null int64 2 rating 211231 non-null int64 dtypes: int64(3) memory usage: 4.8 MB ``` Code is run inside a Jupyter notebook inside of VS Code: ```python import pandas as pd import pandas_profiling as pp rating5_df = pd.read_csv("./data/trainRatings5.txt", sep = "\t", header=None) rating5_df = rating5_df.rename(columns={0: "user", 1: "film", 2:"rating"}) pp.ProfileReport(rating5_df) ``` And the output it produces is: ```python Summarize dataset: 0%| | 0/8 [00:00 343 return method() 344 return None 345 else: File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/profile_report.py:418, in ProfileReport._repr_html_(self) 416 def _repr_html_(self) -> None: 417 """The ipython notebook widgets user interface gets called by the jupyter notebook.""" --> 418 self.to_notebook_iframe() File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/profile_report.py:398, in ProfileReport.to_notebook_iframe(self) 396 with warnings.catch_warnings(): 397 warnings.simplefilter("ignore") --> 398 display(get_notebook_iframe(self.config, self)) File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/report/presentation/flavours/widget/notebook.py:75, in get_notebook_iframe(config, profile) 73 output = get_notebook_iframe_src(config, profile) 74 elif attribute == IframeAttribute.srcdoc: ---> 75 output = get_notebook_iframe_srcdoc(config, profile) 76 else: 77 raise ValueError( 78 f'Iframe Attribute can be "src" or "srcdoc" (current: {attribute}).' 79 ) File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/report/presentation/flavours/widget/notebook.py:29, in get_notebook_iframe_srcdoc(config, profile) 27 width = config.notebook.iframe.width 28 height = config.notebook.iframe.height ---> 29 src = html.escape(profile.to_html()) 31 iframe = f'' 33 return HTML(iframe) File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/profile_report.py:368, in ProfileReport.to_html(self) 360 def to_html(self) -> str: 361 """Generate and return complete template as lengthy string 362 for using with frameworks. 363 (...) 366 367 """ --> 368 return self.html File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/profile_report.py:185, in ProfileReport.html(self) 182 @property 183 def html(self) -> str: 184 if self._html is None: --> 185 self._html = self._render_html() 186 return self._html File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/profile_report.py:287, in ProfileReport._render_html(self) 284 def _render_html(self) -> str: 285 from pandas_profiling.report.presentation.flavours import HTMLReport --> 287 report = self.report 289 with tqdm( 290 total=1, desc="Render HTML", disable=not self.config.progress_bar 291 ) as pbar: 292 html = HTMLReport(copy.deepcopy(report)).render( 293 nav=self.config.html.navbar_show, 294 offline=self.config.html.use_local_assets, (...) 302 version=self.description_set["package"]["pandas_profiling_version"], 303 ) File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/profile_report.py:179, in ProfileReport.report(self) 176 @property 177 def report(self) -> Root: 178 if self._report is None: --> 179 self._report = get_report_structure(self.config, self.description_set) 180 return self._report File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/profile_report.py:161, in ProfileReport.description_set(self) 158 @property 159 def description_set(self) -> Dict[str, Any]: 160 if self._description_set is None: --> 161 self._description_set = describe_df( 162 self.config, 163 self.df, 164 self.summarizer, 165 self.typeset, 166 self._sample, 167 ) 168 return self._description_set File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/model/describe.py:71, in describe(config, df, summarizer, typeset, sample) 69 # Variable-specific 70 pbar.total += len(df.columns) ---> 71 series_description = get_series_descriptions( 72 config, df, summarizer, typeset, pbar 73 ) 75 pbar.set_postfix_str("Get variable types") 76 pbar.total += 1 File ~/projects/master-rad/env/lib/python3.8/site-packages/multimethod/__init__.py:300, in multimethod.__call__(self, *args, **kwargs) 298 func = self[tuple(func(arg) for func, arg in zip(self.type_checkers, args))] 299 try: --> 300 return func(*args, **kwargs) 301 except TypeError as ex: 302 raise DispatchError(f"Function {func.__code__}") from ex File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/model/pandas/summary_pandas.py:92, in pandas_get_series_descriptions(config, df, summarizer, typeset, pbar) 89 else: 90 # TODO: use `Pool` for Linux-based systems 91 with multiprocessing.pool.ThreadPool(pool_size) as executor: ---> 92 for i, (column, description) in enumerate( 93 executor.imap_unordered(multiprocess_1d, args) 94 ): 95 pbar.set_postfix_str(f"Describe variable:{column}") 96 series_description[column] = description File /usr/lib/python3.8/multiprocessing/pool.py:868, in IMapIterator.next(self, timeout) 866 if success: 867 return value --> 868 raise value File /usr/lib/python3.8/multiprocessing/pool.py:125, in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception) 123 job, i, func, args, kwds = task 124 try: --> 125 result = (True, func(*args, **kwds)) 126 except Exception as e: 127 if wrap_exception and func is not _helper_reraises_exception: File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/model/pandas/summary_pandas.py:72, in pandas_get_series_descriptions..multiprocess_1d(args) 62 """Wrapper to process series in parallel. 63 64 Args: (...) 69 A tuple with column and the series description. 70 """ 71 column, series = args ---> 72 return column, describe_1d(config, series, summarizer, typeset) File ~/projects/master-rad/env/lib/python3.8/site-packages/multimethod/__init__.py:300, in multimethod.__call__(self, *args, **kwargs) 298 func = self[tuple(func(arg) for func, arg in zip(self.type_checkers, args))] 299 try: --> 300 return func(*args, **kwargs) 301 except TypeError as ex: 302 raise DispatchError(f"Function {func.__code__}") from ex File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/model/pandas/summary_pandas.py:50, in pandas_describe_1d(config, series, summarizer, typeset) 45 else: 46 # Detect variable types from pandas dataframe (df.dtypes). 47 # [new dtypes, changed using `astype` function are now considered] 48 vtype = typeset.detect_type(series) ---> 50 return summarizer.summarize(config, series, dtype=vtype) File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/model/summarizer.py:37, in BaseSummarizer.summarize(self, config, series, dtype) 29 def summarize( 30 self, config: Settings, series: pd.Series, dtype: Type[VisionsBaseType] 31 ) -> dict: 32 """ 33 34 Returns: 35 object: 36 """ ---> 37 _, _, summary = self.handle(str(dtype), config, series, {"type": str(dtype)}) 38 return summary File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/model/handler.py:62, in Handler.handle(self, dtype, *args, **kwargs) 60 funcs = self.mapping.get(dtype, []) 61 op = compose(funcs) ---> 62 return op(*args) File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/model/handler.py:21, in compose..func..func2(*x) 19 return f(*x) 20 else: ---> 21 return f(*res) File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/model/handler.py:21, in compose..func..func2(*x) 19 return f(*x) 20 else: ---> 21 return f(*res) File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/model/handler.py:21, in compose..func..func2(*x) 19 return f(*x) 20 else: ---> 21 return f(*res) File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/model/handler.py:17, in compose..func..func2(*x) 16 def func2(*x) -> Any: ---> 17 res = g(*x) 18 if type(res) == bool: 19 return f(*x) File ~/projects/master-rad/env/lib/python3.8/site-packages/multimethod/__init__.py:300, in multimethod.__call__(self, *args, **kwargs) 298 func = self[tuple(func(arg) for func, arg in zip(self.type_checkers, args))] 299 try: --> 300 return func(*args, **kwargs) 301 except TypeError as ex: 302 raise DispatchError(f"Function {func.__code__}") from ex File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/model/summary_algorithms.py:65, in series_hashable..inner(config, series, summary) 63 if not summary["hashable"]: 64 return config, series, summary ---> 65 return fn(config, series, summary) File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/model/summary_algorithms.py:82, in series_handle_nulls..inner(config, series, summary) 79 if series.hasnans: 80 series = series.dropna() ---> 82 return fn(config, series, summary) File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/model/pandas/describe_categorical_pandas.py:205, in pandas_describe_categorical_1d(config, series, summary) 202 summary["chi_squared"] = chi_square(histogram=value_counts.values) 204 if config.vars.cat.length: --> 205 summary.update(length_summary_vc(value_counts)) 206 summary.update( 207 histogram_compute( 208 config, (...) 213 ) 214 ) 216 if config.vars.cat.characters: File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/model/pandas/describe_categorical_pandas.py:162, in length_summary_vc(vc) 156 length_counts = length_counts.groupby(level=0, sort=False).sum() 157 length_counts = length_counts.sort_values(ascending=False) 159 summary = { 160 "max_length": np.max(length_counts.index), 161 "mean_length": np.average(length_counts.index, weights=length_counts.values), --> 162 "median_length": weighted_median( 163 length_counts.index.values, weights=length_counts.values 164 ), 165 "min_length": np.min(length_counts.index), 166 "length_histogram": length_counts, 167 } 169 return summary File ~/projects/master-rad/env/lib/python3.8/site-packages/pandas_profiling/model/pandas/utils_pandas.py:13, in weighted_median(data, weights) 11 midpoint = 0.5 * sum(s_weights) 12 if any(weights > midpoint): ---> 13 w_median = (data[weights == np.max(weights)])[0] 14 else: 15 cs_weights = np.cumsum(s_weights) IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices ``` **Version information:**
Click to expand Version information

``` Python 3.8.10 IPython : 8.0.1 ipykernel : 6.9.0 ipywidgets : 7.6.5 jupyter_client : 7.1.2 jupyter_core : 4.9.1 jupyter_server : 1.13.5 jupyterlab : 3.2.9 nbclient : 0.5.10 nbconvert : 6.4.2 nbformat : 5.1.3 notebook : 6.4.8 qtconsole : not installed traitlets : 5.1.1 anyio==3.5.0 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 asttokens==2.0.5 attrs==21.4.0 Babel==2.9.1 backcall==0.2.0 black==22.1.0 bleach==4.1.0 certifi==2021.10.8 cffi==1.15.0 charset-normalizer==2.0.11 click==8.0.3 cycler==0.11.0 debugpy==1.5.1 decorator==5.1.1 defusedxml==0.7.1 entrypoints==0.4 executing==0.8.2 fonttools==4.29.1 htmlmin==0.1.12 idna==3.3 ImageHash==4.2.1 importlib-resources==5.4.0 ipykernel==6.9.0 ipython==8.0.1 ipython-genutils==0.2.0 ipywidgets==7.6.5 jedi==0.18.1 Jinja2==3.0.3 joblib==1.0.1 json5==0.9.6 jsonschema==4.4.0 jupyter-client==7.1.2 jupyter-core==4.9.1 jupyter-server==1.13.5 jupyterlab==3.2.9 jupyterlab-pygments==0.1.2 jupyterlab-server==2.10.3 jupyterlab-widgets==1.0.2 kiwisolver==1.3.2 MarkupSafe==2.0.1 matplotlib==3.5.1 matplotlib-inline==0.1.3 missingno==0.5.0 mistune==0.8.4 multimethod==1.7 mypy-extensions==0.4.3 nbclassic==0.3.5 nbclient==0.5.10 nbconvert==6.4.2 nbformat==5.1.3 nest-asyncio==1.5.4 networkx==2.6.3 notebook==6.4.8 numpy==1.22.2 packaging==21.3 pandas==1.4.0 pandas-profiling==3.1.0 pandocfilters==1.5.0 parso==0.8.3 pathspec==0.9.0 pexpect==4.8.0 phik==0.12.0 pickleshare==0.7.5 Pillow==9.0.1 platformdirs==2.5.0 prometheus-client==0.13.1 prompt-toolkit==3.0.27 ptyprocess==0.7.0 pure-eval==0.2.2 pycparser==2.21 pydantic==1.9.0 Pygments==2.11.2 pyparsing==3.0.7 pyrsistent==0.18.1 python-dateutil==2.8.2 pytz==2021.3 PyWavelets==1.2.0 PyYAML==6.0 pyzmq==22.3.0 requests==2.27.1 scipy==1.8.0 seaborn==0.11.2 Send2Trash==1.8.0 six==1.16.0 sniffio==1.2.0 stack-data==0.1.4 tangled-up-in-unicode==0.1.0 terminado==0.13.1 testpath==0.5.0 tomli==2.0.1 tornado==6.1 tqdm==4.62.3 traitlets==5.1.1 typing_extensions==4.0.1 urllib3==1.26.8 visions==0.7.4 wcwidth==0.2.5 webencodings==0.5.1 websocket-client==1.2.3 widgetsnbextension==3.5.2 zipp==3.7.0 ```

Additional context I have provided the file that contains the dataframe in question.

trainRatings5.txt

qquppsala commented 2 years ago

Hi! Fix for this issue is here

https://github.com/ydataai/pandas-profiling/issues/911

At least helped to me

jfsantos-ds commented 2 years ago

Given the merge of #945 I am considering this issue fixed and can be closed.