ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.51k stars 1.68k forks source link

Does pandas-profiling work in Jupyter Notebooks on AWS? #1197

Open JohnTravolski opened 1 year ago

JohnTravolski commented 1 year ago

Does pandas-profiling work in Jupyter Notebooks on AWS? I understand there are a lot of configuration differences that can lead to issues but whenever I try to produce a profiling report, I get the following errors when I run:

profile = ProfileReport(df, 'myreport')
profile.to_file('s3://myfolder/myreport.html')
Summarize dataset:  97%|█████████▋| 427/438 [01:14<00:01,  8.03it/s, Calculate auto correlation]                    /home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/multimethod/__init__.py:315: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  return func(*args, **kwargs)
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:112: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  warnings.warn("The input array could not be properly "
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:4881: ConstantInputWarning: An input array is constant; the correlation coefficient is not defined.
  warnings.warn(stats.ConstantInputWarning(warn_msg))
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/pandas_profiling/model/correlations.py:67: UserWarning: There was an attempt to calculate the auto correlation, but this failed.
To hide this warning, disable the calculation
(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/pandas-profiling/issues
(include the error message: 'No data; `observed` has size 0.')
  warnings.warn(
Summarize dataset:  98%|█████████▊| 428/438 [28:20<32:48, 196.80s/it, Calculate spearman correlation]/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/multimethod/__init__.py:315: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  return func(*args, **kwargs)
Summarize dataset:  98%|█████████▊| 430/438 [30:55<21:07, 158.47s/it, Calculate kendall correlation] /home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:5218: RuntimeWarning: overflow encountered in long_scalars
  (2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))
/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/scipy/stats/_stats_py.py:5219: RuntimeWarning: invalid value encountered in sqrt
  z = con_minus_dis / np.sqrt(var)
Summarize dataset:  99%|█████████▊| 432/438 [45:40<00:38,  6.34s/it, Calculate phi_k correlation]   
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/joblib/externals/loky/backend/queues.py", line 125, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/joblib/externals/loky/backend/reduction.py", line 211, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/joblib/externals/loky/backend/reduction.py", line 204, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 632, in dump
    return Pickler.dump(self, obj)
  File "/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/joblib/_memmapping_reducer.py", line 446, in __call__
    for dumped_filename in dump(a, filename):
  File "/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/joblib/numpy_pickle.py", line 553, in dump
    NumpyPickler(f, protocol=protocol).dump(value)
  File "/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/pickle.py", line 487, in dump
    self.save(obj)
  File "/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/joblib/numpy_pickle.py", line 352, in save
    wrapper.write_array(obj, self)
  File "/home/ec2-user/SageMaker/.envs/mykernel/lib/python3.9/site-packages/joblib/numpy_pickle.py", line 134, in write_array
    pickler.file_handle.write(chunk.tobytes('C'))
OSError: [Errno 28] No space left on device
"""

The above exception was the direct cause of the following exception:

PicklingError                             Traceback (most recent call last)
<ipython-input-9-34649000e9e9> in <module>
      1 profile = ProfileReport(df_perf_18, title="MyReport")
----> 2 profile.to_file(f"s3://sf-puas-prod-use1-pc/fire/research/home_telematics/adt/analysis/MyReport.html")

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/typeguard/__init__.py in wrapper(*args, **kwargs)
   1031         memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
   1032         check_argument_types(memo)
-> 1033         retval = func(*args, **kwargs)
   1034         try:
   1035             check_return_type(retval, memo)

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/pandas_profiling/profile_report.py in to_file(self, output_file, silent)
    307                 create_html_assets(self.config, output_file)
    308 
--> 309             data = self.to_html()
    310 
    311             if output_file.suffix != ".html":

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/typeguard/__init__.py in wrapper(*args, **kwargs)
   1031         memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
   1032         check_argument_types(memo)
-> 1033         retval = func(*args, **kwargs)
   1034         try:
   1035             check_return_type(retval, memo)

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/pandas_profiling/profile_report.py in to_html(self)
    418 
    419         """
--> 420         return self.html
    421 
    422     def to_json(self) -> str:

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/typeguard/__init__.py in wrapper(*args, **kwargs)
   1031         memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
   1032         check_argument_types(memo)
-> 1033         retval = func(*args, **kwargs)
   1034         try:
   1035             check_return_type(retval, memo)

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/pandas_profiling/profile_report.py in html(self)
    229     def html(self) -> str:
    230         if self._html is None:
--> 231             self._html = self._render_html()
    232         return self._html
    233 

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/typeguard/__init__.py in wrapper(*args, **kwargs)
   1031         memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
   1032         check_argument_types(memo)
-> 1033         retval = func(*args, **kwargs)
   1034         try:
   1035             check_return_type(retval, memo)

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/pandas_profiling/profile_report.py in _render_html(self)
    337         from pandas_profiling.report.presentation.flavours import HTMLReport
    338 
--> 339         report = self.report
    340 
    341         with tqdm(

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/typeguard/__init__.py in wrapper(*args, **kwargs)
   1031         memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
   1032         check_argument_types(memo)
-> 1033         retval = func(*args, **kwargs)
   1034         try:
   1035             check_return_type(retval, memo)

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/pandas_profiling/profile_report.py in report(self)
    223     def report(self) -> Root:
    224         if self._report is None:
--> 225             self._report = get_report_structure(self.config, self.description_set)
    226         return self._report
    227 

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/typeguard/__init__.py in wrapper(*args, **kwargs)
   1031         memo = _CallMemo(python_func, _localns, args=args, kwargs=kwargs)
   1032         check_argument_types(memo)
-> 1033         retval = func(*args, **kwargs)
   1034         try:
   1035             check_return_type(retval, memo)

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/pandas_profiling/profile_report.py in description_set(self)
    205     def description_set(self) -> Dict[str, Any]:
    206         if self._description_set is None:
--> 207             self._description_set = describe_df(
    208                 self.config,
    209                 self.df,

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/pandas_profiling/model/describe.py in describe(config, df, summarizer, typeset, sample)
     93         pbar.total += len(correlation_names)
     94 
---> 95         correlations = {
     96             correlation_name: progress(
     97                 calculate_correlation, pbar, f"Calculate {correlation_name} correlation"

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/pandas_profiling/model/describe.py in <dictcomp>(.0)
     94 
     95         correlations = {
---> 96             correlation_name: progress(
     97                 calculate_correlation, pbar, f"Calculate {correlation_name} correlation"
     98             )(config, df, correlation_name, series_description)

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/pandas_profiling/utils/progress_bar.py in inner(*args, **kwargs)
      9     def inner(*args, **kwargs) -> Any:
     10         bar.set_postfix_str(message)
---> 11         ret = fn(*args, **kwargs)
     12         bar.update()
     13         return ret

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/pandas_profiling/model/correlations.py in calculate_correlation(config, df, correlation_name, summary)
    105     correlation = None
    106     try:
--> 107         correlation = correlation_measures[correlation_name].compute(
    108             config, df, summary
    109         )

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/multimethod/__init__.py in __call__(self, *args, **kwargs)
    313         func = self[tuple(func(arg) for func, arg in zip(self.type_checkers, args))]
    314         try:
--> 315             return func(*args, **kwargs)
    316         except TypeError as ex:
    317             raise DispatchError(f"Function {func.__code__}") from ex

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/pandas_profiling/model/pandas/correlations_pandas.py in pandas_phik_compute(config, df, summary)
    152         from phik import phik_matrix
    153 
--> 154         correlation = phik_matrix(df[selected_cols], interval_cols=list(intcols))
    155 
    156     return correlation

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/phik/phik.py in phik_matrix(df, interval_cols, bins, quantile, noise_correction, dropna, drop_underflow, drop_overflow, verbose, njobs)
    254         verbose=verbose,
    255     )
--> 256     return phik_from_rebinned_df(
    257         data_binned,
    258         noise_correction,

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/phik/phik.py in phik_from_rebinned_df(data_binned, noise_correction, dropna, drop_underflow, drop_overflow, njobs)
    164         ]
    165     else:
--> 166         phik_list = Parallel(n_jobs=njobs)(
    167             delayed(_calc_phik)(co, data_binned[list(co)], noise_correction)
    168             for co in itertools.combinations_with_replacement(

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/joblib/parallel.py in __call__(self, iterable)
   1096 
   1097             with self._backend.retrieval_context():
-> 1098                 self.retrieve()
   1099             # Make sure that we get a last message telling us we are done
   1100             elapsed_time = time.time() - self._start_time

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/joblib/parallel.py in retrieve(self)
    973             try:
    974                 if getattr(self._backend, 'supports_timeout', False):
--> 975                     self._output.extend(job.get(timeout=self.timeout))
    976                 else:
    977                     self._output.extend(job.get())

~/SageMaker/.envs/mykernel/lib/python3.9/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    565         AsyncResults.get from multiprocessing."""
    566         try:
--> 567             return future.result(timeout=timeout)
    568         except CfTimeoutError as e:
    569             raise TimeoutError from e

~/SageMaker/.envs/mykernel/lib/python3.9/concurrent/futures/_base.py in result(self, timeout)
    436                     raise CancelledError()
    437                 elif self._state == FINISHED:
--> 438                     return self.__get_result()
    439 
    440                 self._condition.wait(timeout)

~/SageMaker/.envs/mykernel/lib/python3.9/concurrent/futures/_base.py in __get_result(self)
    388         if self._exception:
    389             try:
--> 390                 raise self._exception
    391             finally:
    392                 # Break a reference cycle with the exception in self._exception

PicklingError: Could not pickle the task to send it to the workers.

I'm on the latest version of pandas-profiling (just installed it today).

fabclmnt commented 1 year ago

@JohnTravolski Pandas profiling does not provide a pickable object.

After generating your report I suggest you save it as an HTML and use the HTML itself, instead of the report, as it is not pickable.

You might also want to test whether the machine you're working with has space for the process, given the error that is being prompted: OSError: [Errno 28] No space left on device

In order to provide better support I might need to have more details on what you aim to achieve. Join us at the data-centric AI community (https://datacentricai.community/) and I'll be happy to further discuss the topic.

JohnTravolski commented 1 year ago

@JohnTravolski Pandas profiling does not provide a pickable object.

After generating your report I suggest you save it as an HTML and use the HTML itself, instead of the report, as it is not pickable.

You might also want to test whether the machine you're working with has space for the process, given the error that is being prompted: OSError: [Errno 28] No space left on device

In order to provide better support I might need to have more details on what you aim to achieve. Join us at the data-centric AI community (https://datacentricai.community/) and I'll be happy to further discuss the topic.

I apologize, I updated my input. I did attempt to save it as an HTML and this was the error that was returned to me. I am simply trying to generate the typical html report and this is the error I receive. I am not very familiar with the AWS environment, but I will try to figure out the no space left on device issue. I was hoping somebody else who has used this with AWS had run into this before.

fabclmnt commented 1 year ago

@JohnTravolski can you please provide more detail on the flow you are building?

So I can better support you.