ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.52k stars 1.68k forks source link

TerminatedWorkerError #476

Closed kraxli closed 1 year ago

kraxli commented 4 years ago

following up on #456

I am running into a TerminatedWorkerError.

Minimal example:

import pandas as pd
import pandas_profiling

plannet_data = pd.read_csv('https://github.com/mwaskom/seaborn-data/blob/master/raw/planets.csv')
display(plannet_data) # ok
plannet_data.profile_report()

Returns the error:

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {EXIT(1)}

Environment:

Linux Mint 19.3 Tricia, base: Ubuntu 18.04 bionic
RAM 16 GB

python 3.6.9

astropy==4.0.1.post1
async-generator==1.10
attrs==19.3.0
autopep8==1.5.2
backcall==0.1.0
bleach==3.1.5
certifi==2020.4.5.1
chardet==3.0.4
confuse==1.1.0
cycler==0.10.0
decorator==4.4.2
defusedxml==0.6.0
descartes==1.1.0
entrypoints==0.3
htmlmin==0.1.12
idna==2.9
ImageHash==4.1.0
importlib-metadata==1.6.0
ipykernel==5.2.1
ipython==7.14.0
ipython-genutils==0.2.0
ipywidgets==7.5.1
jedi==0.17.0
Jinja2==2.11.2
joblib==0.15.1
json5==0.9.4
jsonschema==3.2.0
jupyter-client==6.1.3
jupyter-core==4.6.3
jupyter-server==0.1.1
jupyterlab==2.1.2
jupyterlab-pygments==0.1.1
jupyterlab-server==1.1.4
kiwisolver==1.2.0
llvmlite==0.32.1
MarkupSafe==1.1.1
matplotlib==3.2.1
missingno==0.4.2
mistune==0.8.4
mizani==0.6.0
nbconvert==5.6.1
nbformat==5.0.6
networkx==2.4
notebook==6.0.3
numba==0.49.1
numpy==1.18.4
packaging==20.3
palettable==3.3.0
pandas==1.0.3
pandas-profiling==2.8.0
pandocfilters==1.4.2
parso==0.7.0
patsy==0.5.1
pexpect==4.8.0
phik==0.9.12
pickleshare==0.7.5
Pillow==7.1.2
plotnine==0.6.0
prometheus-client==0.7.1
prompt-toolkit==3.0.5
ptyprocess==0.6.0
pycodestyle==2.6.0
Pygments==2.6.1
pyparsing==2.4.7
pyrsistent==0.16.0
python-dateutil==2.8.1
pytz==2020.1
PyWavelets==1.1.1
PyYAML==5.3.1
pyzmq==19.0.1
requests==2.23.0
scipy==1.4.1
seaborn==0.10.1
Send2Trash==1.5.0
six==1.14.0
statsmodels==0.11.1
tangled-up-in-unicode==0.0.6
terminado==0.8.3
testpath==0.4.4
tornado==6.0.4
tqdm==4.46.0
traitlets==4.3.3
urllib3==1.25.9
visions==0.4.4
voila==0.1.21
wcwidth==0.1.9
webencodings==0.5.1
widgetsnbextension==3.5.1
zipp==3.1.0

thanks David

github-actions[bot] commented 4 years ago

Stale issue

varshithvvs commented 3 years ago

I'm facing the same issue when executing within my Jupyter notebook. Is there a fix for the same?

My Code (Sample Version):

df = pd.DataFrame(
    np.random.rand(1000000, 5),
    columns=['a', 'b', 'c', 'd', 'e']
)

profile = ProfileReport(df, title='Pandas Profiling Report') 
profile.to_notebook_iframe()

Error After Executing the Same:

Summarize dataset: 50%
9/18 [00:18<00:14, 1.63s/it, Calculate phi_k correlation]
exception calling callback for <Future at 0x7fe3084a3310 state=finished raised TerminatedWorkerError>
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 359, in __call__
    self.parallel.dispatch_next()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 792, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 531, in apply_async
    future = self._workers.submit(SafeFunction(func))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/externals/loky/reusable_executor.py", line 178, in submit
    fn, *args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 1102, in submit
    raise self._flags.broken
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 1044, in __call__
    while self.dispatch_one_batch(iterator):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 531, in apply_async
    future = self._workers.submit(SafeFunction(func))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/externals/loky/reusable_executor.py", line 178, in submit
    fn, *args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 1102, in submit
    raise self._flags.broken
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {EXIT(1), EXIT(1), EXIT(1)}
exception calling callback for <Future at 0x7fe3084a6950 state=finished raised TerminatedWorkerError>
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 359, in __call__
    self.parallel.dispatch_next()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 792, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 531, in apply_async
    future = self._workers.submit(SafeFunction(func))
AttributeError: 'NoneType' object has no attribute 'submit'
exception calling callback for <Future at 0x7fe3084a6850 state=finished raised TerminatedWorkerError>
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 359, in __call__
    self.parallel.dispatch_next()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 792, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 531, in apply_async
    future = self._workers.submit(SafeFunction(func))
AttributeError: 'NoneType' object has no attribute 'submit'
exception calling callback for <Future at 0x7fe3084a6750 state=finished raised TerminatedWorkerError>
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 359, in __call__
    self.parallel.dispatch_next()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 792, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 531, in apply_async
    future = self._workers.submit(SafeFunction(func))
AttributeError: 'NoneType' object has no attribute 'submit'
exception calling callback for <Future at 0x7fe3084a6fd0 state=finished raised TerminatedWorkerError>
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 359, in __call__
    self.parallel.dispatch_next()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 792, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 531, in apply_async
    future = self._workers.submit(SafeFunction(func))
AttributeError: 'NoneType' object has no attribute 'submit'
exception calling callback for <Future at 0x7fe3084a6610 state=finished raised TerminatedWorkerError>
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 359, in __call__
    self.parallel.dispatch_next()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 792, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 531, in apply_async
    future = self._workers.submit(SafeFunction(func))
AttributeError: 'NoneType' object has no attribute 'submit'
---------------------------------------------------------------------------
TerminatedWorkerError                     Traceback (most recent call last)
<ipython-input-5-3827eec15fb0> in <module>
      1 #profile = ProfileReport(df._to_pandas(), title='Pandas Profiling Report')
      2 profile = ProfileReport(df, title='Pandas Profiling Report')
----> 3 profile.to_notebook_iframe()

/opt/conda/envs/rapids/lib/python3.7/site-packages/pandas_profiling/profile_report.py in to_notebook_iframe(self)
    400         with warnings.catch_warnings():
    401             warnings.simplefilter("ignore")
--> 402             display(get_notebook_iframe(self.config, self))
    403 
    404     def to_widgets(self) -> None:

/opt/conda/envs/rapids/lib/python3.7/site-packages/pandas_profiling/report/presentation/flavours/widget/notebook.py in get_notebook_iframe(config, profile)
     73         output = get_notebook_iframe_src(config, profile)
     74     elif attribute == IframeAttribute.srcdoc:
---> 75         output = get_notebook_iframe_srcdoc(config, profile)
     76     else:
     77         raise ValueError(

/opt/conda/envs/rapids/lib/python3.7/site-packages/pandas_profiling/report/presentation/flavours/widget/notebook.py in get_notebook_iframe_srcdoc(config, profile)
     27     width = config.notebook.iframe.width
     28     height = config.notebook.iframe.height
---> 29     src = html.escape(profile.to_html())
     30 
     31     iframe = f'<iframe width="{width}" height="{height}" srcdoc="{src}" frameborder="0" allowfullscreen></iframe>'

/opt/conda/envs/rapids/lib/python3.7/site-packages/pandas_profiling/profile_report.py in to_html(self)
    370 
    371         """
--> 372         return self.html
    373 
    374     def to_json(self) -> str:

/opt/conda/envs/rapids/lib/python3.7/site-packages/pandas_profiling/profile_report.py in html(self)
    187     def html(self) -> str:
    188         if self._html is None:
--> 189             self._html = self._render_html()
    190         return self._html
    191 

/opt/conda/envs/rapids/lib/python3.7/site-packages/pandas_profiling/profile_report.py in _render_html(self)
    289         from pandas_profiling.report.presentation.flavours import HTMLReport
    290 
--> 291         report = self.report
    292 
    293         with tqdm(

/opt/conda/envs/rapids/lib/python3.7/site-packages/pandas_profiling/profile_report.py in report(self)
    181     def report(self) -> Root:
    182         if self._report is None:
--> 183             self._report = get_report_structure(self.config, self.description_set)
    184         return self._report
    185 

/opt/conda/envs/rapids/lib/python3.7/site-packages/pandas_profiling/profile_report.py in description_set(self)
    168                 self.summarizer,
    169                 self.typeset,
--> 170                 self._sample,
    171             )
    172         return self._description_set

/opt/conda/envs/rapids/lib/python3.7/site-packages/pandas_profiling/model/describe.py in describe(config, df, summarizer, typeset, sample)
     98             pbar.set_postfix_str(f"Calculate {correlation_name} correlation")
     99             correlations[correlation_name] = calculate_correlation(
--> 100                 config, df, correlation_name, series_description
    101             )
    102             pbar.update()

/opt/conda/envs/rapids/lib/python3.7/site-packages/pandas_profiling/model/correlations.py in calculate_correlation(config, df, correlation_name, summary)
    183     try:
    184         correlation = correlation_measures[correlation_name].compute(
--> 185             config, df, summary
    186         )
    187     except (ValueError, AssertionError, TypeError, DataError, IndexError) as e:

/opt/conda/envs/rapids/lib/python3.7/site-packages/pandas_profiling/model/correlations.py in compute(config, df, summary)
    138             from phik import phik_matrix
    139 
--> 140             correlation = phik_matrix(df[selcols], interval_cols=list(intcols))
    141 
    142         return correlation

/opt/conda/envs/rapids/lib/python3.7/site-packages/phik/phik.py in phik_matrix(df, interval_cols, bins, quantile, noise_correction, dropna, drop_underflow, drop_overflow, verbose)
    217 
    218     return phik_from_rebinned_df(
--> 219         data_binned, noise_correction, dropna=dropna, drop_underflow=drop_underflow, drop_overflow=drop_overflow
    220     )
    221 

/opt/conda/envs/rapids/lib/python3.7/site-packages/phik/phik.py in phik_from_rebinned_df(data_binned, noise_correction, dropna, drop_underflow, drop_overflow)
    143         phik_list = Parallel(n_jobs=NCORES)(
    144             delayed(_calc_phik)(co, data_binned[list(co)], noise_correction)
--> 145             for co in itertools.combinations_with_replacement(data_binned.columns.values, 2)
    146         )
    147 

/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
   1042                 self._iterating = self._original_iterator is not None
   1043 
-> 1044             while self.dispatch_one_batch(iterator):
   1045                 pass
   1046 

/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    857                 return False
    858             else:
--> 859                 self._dispatch(tasks)
    860                 return True
    861 

/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
    775         with self._lock:
    776             job_idx = len(self._jobs)
--> 777             job = self._backend.apply_async(batch, callback=cb)
    778             # A job can complete so quickly than its callback is
    779             # called before we get here, causing self._jobs to

/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    529     def apply_async(self, func, callback=None):
    530         """Schedule a func to be run"""
--> 531         future = self._workers.submit(SafeFunction(func))
    532         future.get = functools.partial(self.wrap_future_result, future)
    533         if callback is not None:

/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/externals/loky/reusable_executor.py in submit(self, fn, *args, **kwargs)
    176         with self._submit_resize_lock:
    177             return super(_ReusablePoolExecutor, self).submit(
--> 178                 fn, *args, **kwargs)
    179 
    180     def _resize(self, max_workers):

/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py in submit(self, fn, *args, **kwargs)
   1100         with self._flags.shutdown_lock:
   1101             if self._flags.broken is not None:
-> 1102                 raise self._flags.broken
   1103             if self._flags.shutdown:
   1104                 raise ShutdownExecutorError(

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {EXIT(1), EXIT(1), EXIT(1)}

Environment:

Jupyter Notebook

pip 21.1.1 from /opt/conda/envs/rapids/lib/python3.7/site-packages/pip (python 3.7)
Package                           Version
--------------------------------- ------------------------
absl-py                           0.12.0
aiobotocore                       1.3.0
aiohttp                           3.7.4
aioitertools                      0.7.1
altgraph                          0.17
anyio                             2.2.0
appdirs                           1.4.4
argon2-cffi                       20.1.0
astunparse                        1.6.3
async-generator                   1.10
async-timeout                     3.0.1
attrs                             20.3.0
backcall                          0.2.0
backports.functools-lru-cache     1.6.4
blazingsql                        0.19.0a0
bleach                            3.3.0
bokeh                             2.2.3
botocore                          1.20.49
Bottleneck                        1.3.2
brotlipy                          0.7.0
bsql-engine                       0.6
cached-property                   1.5.2
cachetools                        4.2.2
certifi                           2020.12.5
cffi                              1.14.5
chardet                           4.0.0
click                             7.1.2
click-plugins                     1.1.1
cligj                             0.7.1
cloudpickle                       1.6.0
colorcet                          2.0.6
confluent-kafka                   1.5.0
cryptography                      3.4.7
cudf                              0.19.2
cudf-kafka                        0.19.2
cugraph                           0.19.0+0.gd72b90b0.dirty
cuml                              0.19.0
cupy                              8.6.0
cusignal                          0.19.0
cuspatial                         0.19.0
custreamz                         0.19.2
cuxfilter                         0.19.1
cycler                            0.10.0
Cython                            0.29.23
cytoolz                           0.11.0
dask                              2021.4.0
dask-cuda                         0.19.0
dask-cudf                         0.19.2
dask-glm                          0.2.0
dask-labextension                 4.0.1
dask-ml                           1.8.0
datashader                        0.11.1
datashape                         0.5.4
decorator                         4.4.2
defusedxml                        0.7.1
deprecation                       2.1.0
distributed                       2021.4.0
entrypoints                       0.3
fa2                               0.3.5
fastavro                          1.4.0
fastrlock                         0.6
filterpy                          1.4.5
Fiona                             1.8.19
flatbuffers                       1.12
fsspec                            2021.4.0
future                            0.18.2
gast                              0.4.0
GDAL                              3.2.2
geopandas                         0.8.1
google-auth                       1.30.0
google-auth-oauthlib              0.4.4
google-pasta                      0.2.0
greenlet                          1.0.0
grpcio                            1.34.1
h5py                              3.1.0
HeapDict                          1.0.1
holoviews                         1.14.3
htmlmin                           0.1.12
idna                              2.10
imagecodecs                       2021.3.31
ImageHash                         4.2.0
imageio                           2.9.0
importlib-metadata                3.10.1
iniconfig                         1.1.1
ipykernel                         5.5.3
ipython                           7.15.0
ipython-genutils                  0.2.0
ipywidgets                        7.6.3
jedi                              0.17.2
Jinja2                            2.11.3
jmespath                          0.10.0
joblib                            1.0.1
JPype1                            1.2.1
json5                             0.9.5
jsonschema                        3.2.0
jupyter-client                    6.1.12
jupyter-contrib-core              0.3.3
jupyter-contrib-nbextensions      0.5.1
jupyter-core                      4.7.1
jupyter-highlight-selected-word   0.2.0
jupyter-latex-envs                1.4.6
jupyter-nbextensions-configurator 0.4.1
jupyter-packaging                 0.9.2
jupyter-server                    1.6.4
jupyter-server-proxy              3.0.2
jupyterlab                        2.1.5
jupyterlab-nvdashboard            0.5.0
jupyterlab-pygments               0.1.2
jupyterlab-server                 1.2.0
jupyterlab-widgets                1.0.0
keras-nightly                     2.5.0.dev2021032900
Keras-Preprocessing               1.1.2
kiwisolver                        1.3.1
llvmlite                          0.36.0
locket                            0.2.0
lxml                              4.6.3
Markdown                          3.3.4
MarkupSafe                        1.1.1
matplotlib                        3.4.2
missingno                         0.4.2
mistune                           0.8.4
modin                             0.9.1
more-itertools                    8.7.0
msgpack                           1.0.2
multidict                         5.1.0
multimethod                       1.4
multipledispatch                  0.6.0
munch                             2.5.0
nbclient                          0.5.3
nbconvert                         6.0.7
nbformat                          5.1.3
nest-asyncio                      1.5.1
netifaces                         0.10.9
networkx                          2.5.1
notebook                          6.4.0
numba                             0.53.1
numpy                             1.19.5
nvtx                              0.2.3
oauthlib                          3.1.0
olefile                           0.46
opt-einsum                        3.3.0
packaging                         20.9
pandas                            1.2.3
pandas-profiling                  3.0.0
pandocfilters                     1.4.2
panel                             0.10.3
param                             1.10.1
parso                             0.7.1
partd                             1.2.0
patsy                             0.5.1
pexpect                           4.8.0
phik                              0.11.2
pickle5                           0.0.11
pickleshare                       0.7.5
Pillow                            8.1.2
pip                               21.1.1
pluggy                            0.13.1
pooch                             1.3.0
prometheus-client                 0.10.1
prompt-toolkit                    3.0.18
protobuf                          3.15.8
psutil                            5.8.0
ptyprocess                        0.7.0
py                                1.10.0
pyarrow                           1.0.1
pyasn1                            0.4.8
pyasn1-modules                    0.2.8
pycparser                         2.20
pyct                              0.4.6
pydantic                          1.8.2
pydeck                            0.5.0
pyee                              7.0.4
Pygments                          2.8.1
PyHive                            0.6.3
pyinstaller                       4.3
pyinstaller-hooks-contrib         2021.1
pynndescent                       0.5.2
pynvml                            8.0.4
pyOpenSSL                         20.0.1
pyparsing                         2.4.7
pypi                              2.1
pyppeteer                         0.2.2
pyproj                            3.0.1
pyrsistent                        0.17.3
PySocks                           1.7.1
pytest                            6.2.3
python-dateutil                   2.8.1
pytz                              2021.1
pyviz-comms                       2.0.1
PyWavelets                        1.1.1
PyYAML                            5.4.1
pyzmq                             22.0.3
requests                          2.25.1
requests-oauthlib                 1.3.0
rmm                               0.19.0
rsa                               4.7.2
Rtree                             0.9.7
s3fs                              2021.4.0
sasl                              0.2.1
scikeras                          0.3.3
scikit-image                      0.18.1
scikit-learn                      0.23.1
scipy                             1.6.0
seaborn                           0.11.1
Send2Trash                        1.5.0
setuptools                        56.2.0
Shapely                           1.7.1
simpervisor                       0.4
six                               1.15.0
sniffio                           1.2.0
sortedcontainers                  2.3.0
SQLAlchemy                        1.4.11
statsmodels                       0.12.2
streamz                           0.6.2
tangled-up-in-unicode             0.1.0
tblib                             1.7.0
tensorboard                       2.5.0
tensorboard-data-server           0.6.1
tensorboard-plugin-wit            1.8.0
tensorflow                        2.5.0
tensorflow-estimator              2.5.0
termcolor                         1.1.0
terminado                         0.9.4
testpath                          0.4.4
threadpoolctl                     2.1.0
thrift                            0.13.0
thrift-sasl                       0.4.2
tifffile                          2021.4.8
toml                              0.10.2
tomlkit                           0.7.0
toolz                             0.11.1
tornado                           6.1
tqdm                              4.60.0
traitlets                         5.0.5
treelite                          1.1.0
treelite-runtime                  1.1.0
typing-extensions                 3.7.4.3
ucx-py                            0.19.0
umap-learn                        0.5.1
urllib3                           1.26.4
visions                           0.7.1
wcwidth                           0.2.5
webencodings                      0.5.1
websockets                        8.1
Werkzeug                          2.0.1
wheel                             0.36.2
widgetsnbextension                3.5.1
wrapt                             1.12.1
xarray                            0.17.0
xgboost                           1.4.0
yapf                              0.31.0
yarl                              1.6.3
zict                              2.0.0
zipp                              3.4.1

Note:

  1. I also had an issue around PEP517 while installing the library however installing build essential solved the issue:
!{sys.executable} -m apt-get update && apt-get install -y build-essential
  1. Do let me know If I'm missing and Apologies if there is anything missing that is basic. I'm a beginner with internet as my guide
vishalsrao commented 2 years ago

The issue occurs while parallelizing phik computation.

As a workaround, parallelization can be disabled by overwriting phik.phik_matrix method with a similar method where the default value of njobs is 1 instead of -1.

import phik
from typing import Tuple, Union, Optional
from phik.binning import auto_bin_data
from phik.phik import phik_from_rebinned_df
import numpy as np

# Same as phik.phik_matrix except for the default value of njobs
def phik_matrix_nJobsDefVal(
    df: pd.DataFrame,
    interval_cols: Optional[list] = None,
    bins: Union[int, list, np.ndarray, dict] = 10,
    quantile: bool = False,
    noise_correction: bool = True,
    dropna: bool = True,
    drop_underflow: bool = True,
    drop_overflow: bool = True,
    verbose: bool = True,
    njobs: int = 1,
) -> pd.DataFrame:
    """
    Correlation matrix of bivariate gaussian derived from chi2-value
    Chi2-value gets converted into correlation coefficient of bivariate gauss
    with correlation value rho, assuming giving binning and number of records.
    Correlation coefficient value is between 0 and 1.
    Bivariate gaussian's range is set to [-5,5] by construction.
    :param pd.DataFrame data_binned: input data
    :param list interval_cols: column names of columns with interval variables.
    :param bins: number of bins, or a list of bin edges (same for all columns), or a dictionary where per column the bins are specified. (default=10)\
    E.g.: bins = {'mileage':5, 'driver_age':[18,25,35,45,55,65,125]}
    :param quantile: when bins is an integer, uniform bins (False) or bins based on quantiles (True)
    :param bool noise_correction: apply noise correction in phik calculation
    :param bool dropna: remove NaN values with True
    :param bool drop_underflow: do not take into account records in underflow bin when True (relevant when binning\
    a numeric variable)
    :param bool drop_overflow: do not take into account records in overflow bin when True (relevant when binning\
    a numeric variable)
    :param bool verbose: if False, do not print all interval columns that are guessed
    :param int njobs: number of parallel jobs used for calculation of phik. default is -1. 1 uses no parallel jobs.
    :return: phik correlation matrix
    """

    data_binned, binning_dict = auto_bin_data(
        df=df,
        interval_cols=interval_cols,
        bins=bins,
        quantile=quantile,
        dropna=dropna,
        verbose=verbose,
    )
    return phik_from_rebinned_df(
        data_binned,
        noise_correction,
        dropna=dropna,
        drop_underflow=drop_underflow,
        drop_overflow=drop_overflow,
        njobs=njobs,
    )

phik.phik_matrix = phik_matrix_nJobsDefVal
kretes commented 2 years ago

Just bumped onto this issue and the fix really works, thanks, @vishalsrao Shouldn't it be that we could pass an argument / configure in env - the number of jobs that is passed to phik_matrix rather than replacing the method fully (which will get out of sync at some point)

sbrugman commented 2 years ago

@kretes Thanks for the bump Tomasz. If anyone is interested, feel free to contribute a PR!

aquemy commented 1 year ago

Hi,

We were not able to reproduce with the current version. My guess is that it is environment related.

The solution proposed above consists in deactivating the call to joblib.Parallel in phik library but does not solve the issue. You might want to report it to PhiK directly: https://github.com/KaveIO/PhiK

Feel free to re-open if you have a way to reproduce consistently.