ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.4k stars 1.67k forks source link

Profiling in Spark cluster erroring out #1350

Open alexandreczg opened 1 year ago

alexandreczg commented 1 year ago

Current Behaviour

Just trying a basic CSV file profiling in a Spark cluster and it's erroring out with the following:

An error was encountered:
Function <code object spark_get_series_descriptions at 0x7f47ec720870, file "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/spark/summary_spark.py", line 67>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 353, in to_file
    data = self.to_html()
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 468, in to_html
    return self.html
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 275, in html
    self._html = self._render_html()
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 383, in _render_html
    report = self.report
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 269, in report
    self._report = get_report_structure(self.config, self.description_set)
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 251, in description_set
    self._description_set = describe_df(
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/describe.py", line 72, in describe
    series_description = get_series_descriptions(
  File "/usr/local/lib/python3.10/site-packages/multimethod/__init__.py", line 317, in __call__
    raise DispatchError(f"Function {func.__code__}") from ex
multimethod.DispatchError: Function <code object spark_get_series_descriptions at 0x7f47ec720870, file "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/spark/summary_spark.py", line 67>

Additionally, install the extra dependency also fails:

pip install ydata-profiling[notebook,unicode,pyspark]==4.2.0

with:

WARNING: ydata-profiling 4.2.0 does not provide the extra 'pyspark'

Not sure if this is a documentation issue, but it's on the readme.

Expected Behaviour

It should just generate the profile. It works if the dataframe is from Pandas.

Data Description

The titanic dataset. https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv

Code that reproduces the bug

from ydata_profiling import ProfileReport

df = spark.read.csv(path_to_csv, sep=r',', header=True, inferSchema=True)
profile = ProfileReport(df, title="Profiling Report")
profile.to_file(f"report.html") # Fails
profile.to_file(f"report.json") # Fails

pandas-profiling version

v4.2.0

Dependencies

spark-3.3.2_py-3.10.10_sc-2.12.15_jvm-11.0.16

numpy                 1.23.3
pandas                1.4.2

OS

Spark Cluster

Checklist

alexandreczg commented 1 year ago

Update upon reading databricks post on this. Changing the report config to:

profile = ProfileReport(df, title="Profiling Report", infer_dtypes=False,
                interactions=None,
                missing_diagrams=None,
                correlations={"auto": {"calculate": False},
                              "pearson": {"calculate": True},
                              "spearman": {"calculate": True}}
                       )

now renders the following error:

An error was encountered:
unsupported format string passed to NoneType.__format__
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 353, in to_file
    data = self.to_html()
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 468, in to_html
    return self.html
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 275, in html
    self._html = self._render_html()
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 383, in _render_html
    report = self.report
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 269, in report
    self._report = get_report_structure(self.config, self.description_set)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/structure/report.py", line 386, in get_report_structure
    render_variables_section(config, summary),
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/structure/report.py", line 161, in render_variables_section
    template_variables.update(render_map_type(config, template_variables))
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/structure/variables/render_real.py", line 199, in render_real
    "value": fmt_numeric(
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/formatters.py", line 26, in inner
    return func(arg, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/formatters.py", line 232, in fmt_numeric
    fmtted = f"{{:.{precision}g}}".format(value)
TypeError: unsupported format string passed to NoneType.__format__
alexandreczg commented 1 year ago

Another error. Different CSV file, since I'm testing things out. But it seems if the files are not exactly pristine, the profiling is out of the window with errors that shouldn't happen. Such DivisionByZero

An error was encountered:
division by zero
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 353, in to_file
    data = self.to_html()
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 468, in to_html
    return self.html
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 275, in html
    self._html = self._render_html()
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 383, in _render_html
    report = self.report
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 269, in report
    self._report = get_report_structure(self.config, self.description_set)
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 251, in description_set
    self._description_set = describe_df(
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/describe.py", line 72, in describe
    series_description = get_series_descriptions(
  File "/usr/local/lib/python3.10/site-packages/multimethod/__init__.py", line 315, in __call__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/spark/summary_spark.py", line 92, in spark_get_series_descriptions
    for i, (column, description) in enumerate(
  File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
  File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/spark/summary_spark.py", line 88, in multiprocess_1d
    return column, describe_1d(config, df.select(column), summarizer, typeset)
  File "/usr/local/lib/python3.10/site-packages/multimethod/__init__.py", line 315, in __call__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/spark/summary_spark.py", line 64, in spark_describe_1d
    return summarizer.summarize(config, series, dtype=vtype)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/summarizer.py", line 42, in summarize
    _, _, summary = self.handle(str(dtype), config, series, {"type": str(dtype)})
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/handler.py", line 62, in handle
    return op(*args)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/handler.py", line 21, in func2
    return f(*res)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/handler.py", line 21, in func2
    return f(*res)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/handler.py", line 17, in func2
    res = g(*x)
  File "/usr/local/lib/python3.10/site-packages/multimethod/__init__.py", line 315, in __call__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/spark/describe_supported_spark.py", line 31, in describe_supported_spark
    summary["p_unique"] = n_unique / count
ZeroDivisionError: division by zero

I believe this can be fixed by adding a value check here just like it is done on L26#

image

fabclmnt commented 1 year ago

Hi @alexandreczg ,

support for ydata-profiling with Spark is included and provided in version 4.2.0. I'm a bit surprised by this. Do you mean the install ydata-profiling[pyspark] is not working?

Do you have an already in use Spark cluster?

Also, can you please make sure that you use this configuration when using Spark dataframes: https://www.databricks.com/blog/2023/04/03/pandas-profiling-now-supports-apache-spark.html

Some of the features from ydata-profiling for pandas Dataframes are not yet supported and might lead to errors.

alexandreczg commented 1 year ago

Hi @fabclmnt , thank you for getting back to me.

I just ran it again, different file. But same outcome. This is my df schema and row count. As you can see nothing abnormal.

image

The profiling code:

from ydata_profiling import ProfileReport
report = ProfileReport(
    df,
    title='Testing ydata-profiling',
    infer_dtypes=False,
    interactions=None,
    missing_diagrams=None,
    correlations={
        "auto": {
            "calculate": False
        },
        "pearson": {
            "calculate": True
        },
        "spearman": {
            "calculate": True
        }
    }
)
report.to_file(f"{filename}.html")
report.to_file(f"{filename}.json")

The error:

An error was encountered:
unsupported format string passed to NoneType.__format__
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 353, in to_file
    data = self.to_html()
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 468, in to_html
    return self.html
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 275, in html
    self._html = self._render_html()
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 383, in _render_html
    report = self.report
  File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 269, in report
    self._report = get_report_structure(self.config, self.description_set)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/structure/report.py", line 386, in get_report_structure
    render_variables_section(config, summary),
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/structure/report.py", line 161, in render_variables_section
    template_variables.update(render_map_type(config, template_variables))
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/structure/variables/render_real.py", line 199, in render_real
    "value": fmt_numeric(
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/formatters.py", line 26, in inner
    return func(arg, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/formatters.py", line 232, in fmt_numeric
    fmtted = f"{{:.{precision}g}}".format(value)
TypeError: unsupported format string passed to NoneType.__format__

I am running this on a Spark Cluster, yes. Company wide. These are my Spark specs:

spark-3.3.2_py-3.10.
11_sc-2.12.
15_jvm-1.8.0

Running ydata-profiling[pyspark] emits the notification:

WARNING: ydata-profiling 4.2.0 does not provide the extra 'pyspark'
alexandreczg commented 1 year ago

Additionally, in your docs you point to this Spark Example but what is funny is that you convert the spark DF to a pandas one... leads me to think that this Spark integration is really not ready for production use. Thoughts?

fabclmnt commented 1 year ago

Additionally, in your docs you point to this Spark Example but what is funny is that you convert the spark DF to a pandas one... leads me to think that this Spark integration is really not ready for production use. Thoughts?

That example is unfortunately outdated and before the release with Spark support.

The example I've sent you in the comment before is the most up to date. Spark support was develop by the community and is still version 1, so there are a lot of bugs and limitations that are being grouped in the roadmap for Spark (https://github.com/orgs/ydataai/projects/16/views/2)

We are always looking for contributions to make it better. If you feel like this could be something interesting for you would be more than happy to have your contribution :)

alexandreczg commented 1 year ago

Thank @fabclmnt, I hear you. I'm gonna open an MR for the 2 bugs I found, which seem to have nothing to do with the Spark implementation, but more to do with how we are using the results from using Spark for analysis. I won't venture myself into the Spark implementation since I don't really know enough about this library and how it uses the Spark API.

Can we confirm they are bugs and not the intended implementation? I am referring to these 2:

TypeError: unsupported format string passed to NoneType.__format__

and

ZeroDivisionError: division by zero

But this really raises the question, the same dataset either analyzed by Pandas or Spark should return the same outcome. It seems this is not the case, which makes me weary of using it.

fabclmnt commented 1 year ago

@alexandreczg I do confirm that those are bugs indeed and need to go into the list of things to have fixed. The zero division seems to be a quick win.

Regarding returning the same output, given the distributed nature and even the data types supported, it is hard to ensure the exact same output, nevertheless, we should always try to guarantee close results.

dgcaron commented 1 year ago

i am running into this issue (division by 0) too. i think it has something to do with the dataframe containing columns where all values are undefined. i have attached a file that causes this issue. rdw-open-data.csv

 raw = spark.read("csv").option("path", path) .load()

 ok  = raw.select("kenteken","voertuigsoort", "merk")

 bug = raw.select("kenteken","voertuigsoort", "merk", "gem_lading_wrde")

if i pass the bug dataframe to the profiler, i get the division by 0 error . if you can't wait for the fix in the source, applying fillna to the dataframe circumvents the issue.

fixed = raw.fillna("")
mkarami-ms commented 5 months ago

I have the problem with the "TypeError: unsupported format string passed to NoneType.format" one, has anyone solved this?