Open alexandreczg opened 1 year ago
Update upon reading databricks post on this. Changing the report config to:
profile = ProfileReport(df, title="Profiling Report", infer_dtypes=False,
interactions=None,
missing_diagrams=None,
correlations={"auto": {"calculate": False},
"pearson": {"calculate": True},
"spearman": {"calculate": True}}
)
now renders the following error:
An error was encountered:
unsupported format string passed to NoneType.__format__
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 353, in to_file
data = self.to_html()
File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 468, in to_html
return self.html
File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 275, in html
self._html = self._render_html()
File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 383, in _render_html
report = self.report
File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 269, in report
self._report = get_report_structure(self.config, self.description_set)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/structure/report.py", line 386, in get_report_structure
render_variables_section(config, summary),
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/structure/report.py", line 161, in render_variables_section
template_variables.update(render_map_type(config, template_variables))
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/structure/variables/render_real.py", line 199, in render_real
"value": fmt_numeric(
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/formatters.py", line 26, in inner
return func(arg, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/formatters.py", line 232, in fmt_numeric
fmtted = f"{{:.{precision}g}}".format(value)
TypeError: unsupported format string passed to NoneType.__format__
Another error. Different CSV file, since I'm testing things out. But it seems if the files are not exactly pristine, the profiling is out of the window with errors that shouldn't happen. Such DivisionByZero
An error was encountered:
division by zero
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 353, in to_file
data = self.to_html()
File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 468, in to_html
return self.html
File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 275, in html
self._html = self._render_html()
File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 383, in _render_html
report = self.report
File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 269, in report
self._report = get_report_structure(self.config, self.description_set)
File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 251, in description_set
self._description_set = describe_df(
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/describe.py", line 72, in describe
series_description = get_series_descriptions(
File "/usr/local/lib/python3.10/site-packages/multimethod/__init__.py", line 315, in __call__
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/spark/summary_spark.py", line 92, in spark_get_series_descriptions
for i, (column, description) in enumerate(
File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 873, in next
raise value
File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/spark/summary_spark.py", line 88, in multiprocess_1d
return column, describe_1d(config, df.select(column), summarizer, typeset)
File "/usr/local/lib/python3.10/site-packages/multimethod/__init__.py", line 315, in __call__
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/spark/summary_spark.py", line 64, in spark_describe_1d
return summarizer.summarize(config, series, dtype=vtype)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/summarizer.py", line 42, in summarize
_, _, summary = self.handle(str(dtype), config, series, {"type": str(dtype)})
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/handler.py", line 62, in handle
return op(*args)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/handler.py", line 21, in func2
return f(*res)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/handler.py", line 21, in func2
return f(*res)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/handler.py", line 17, in func2
res = g(*x)
File "/usr/local/lib/python3.10/site-packages/multimethod/__init__.py", line 315, in __call__
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/model/spark/describe_supported_spark.py", line 31, in describe_supported_spark
summary["p_unique"] = n_unique / count
ZeroDivisionError: division by zero
I believe this can be fixed by adding a value check here just like it is done on L26#
Hi @alexandreczg ,
support for ydata-profiling with Spark is included and provided in version 4.2.0. I'm a bit surprised by this. Do you mean the install ydata-profiling[pyspark] is not working?
Do you have an already in use Spark cluster?
Also, can you please make sure that you use this configuration when using Spark dataframes: https://www.databricks.com/blog/2023/04/03/pandas-profiling-now-supports-apache-spark.html
Some of the features from ydata-profiling for pandas Dataframes are not yet supported and might lead to errors.
Hi @fabclmnt , thank you for getting back to me.
I just ran it again, different file. But same outcome. This is my df schema and row count. As you can see nothing abnormal.
The profiling code:
from ydata_profiling import ProfileReport
report = ProfileReport(
df,
title='Testing ydata-profiling',
infer_dtypes=False,
interactions=None,
missing_diagrams=None,
correlations={
"auto": {
"calculate": False
},
"pearson": {
"calculate": True
},
"spearman": {
"calculate": True
}
}
)
report.to_file(f"{filename}.html")
report.to_file(f"{filename}.json")
The error:
An error was encountered:
unsupported format string passed to NoneType.__format__
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 353, in to_file
data = self.to_html()
File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 468, in to_html
return self.html
File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 275, in html
self._html = self._render_html()
File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 383, in _render_html
report = self.report
File "/usr/local/lib/python3.10/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/profile_report.py", line 269, in report
self._report = get_report_structure(self.config, self.description_set)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/structure/report.py", line 386, in get_report_structure
render_variables_section(config, summary),
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/structure/report.py", line 161, in render_variables_section
template_variables.update(render_map_type(config, template_variables))
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/structure/variables/render_real.py", line 199, in render_real
"value": fmt_numeric(
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/formatters.py", line 26, in inner
return func(arg, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/ydata_profiling/report/formatters.py", line 232, in fmt_numeric
fmtted = f"{{:.{precision}g}}".format(value)
TypeError: unsupported format string passed to NoneType.__format__
I am running this on a Spark Cluster, yes. Company wide. These are my Spark specs:
spark-3.3.2_py-3.10.
11_sc-2.12.
15_jvm-1.8.0
Running ydata-profiling[pyspark]
emits the notification:
WARNING: ydata-profiling 4.2.0 does not provide the extra 'pyspark'
Additionally, in your docs you point to this Spark Example but what is funny is that you convert the spark DF to a pandas one... leads me to think that this Spark integration is really not ready for production use. Thoughts?
Additionally, in your docs you point to this Spark Example but what is funny is that you convert the spark DF to a pandas one... leads me to think that this Spark integration is really not ready for production use. Thoughts?
That example is unfortunately outdated and before the release with Spark support.
The example I've sent you in the comment before is the most up to date. Spark support was develop by the community and is still version 1, so there are a lot of bugs and limitations that are being grouped in the roadmap for Spark (https://github.com/orgs/ydataai/projects/16/views/2)
We are always looking for contributions to make it better. If you feel like this could be something interesting for you would be more than happy to have your contribution :)
Thank @fabclmnt, I hear you. I'm gonna open an MR for the 2 bugs I found, which seem to have nothing to do with the Spark implementation, but more to do with how we are using the results from using Spark for analysis. I won't venture myself into the Spark implementation since I don't really know enough about this library and how it uses the Spark API.
Can we confirm they are bugs and not the intended implementation? I am referring to these 2:
TypeError: unsupported format string passed to NoneType.__format__
and
ZeroDivisionError: division by zero
But this really raises the question, the same dataset either analyzed by Pandas or Spark should return the same outcome. It seems this is not the case, which makes me weary of using it.
@alexandreczg I do confirm that those are bugs indeed and need to go into the list of things to have fixed. The zero division seems to be a quick win.
Regarding returning the same output, given the distributed nature and even the data types supported, it is hard to ensure the exact same output, nevertheless, we should always try to guarantee close results.
i am running into this issue (division by 0) too. i think it has something to do with the dataframe containing columns where all values are undefined. i have attached a file that causes this issue. rdw-open-data.csv
raw = spark.read("csv").option("path", path) .load()
ok = raw.select("kenteken","voertuigsoort", "merk")
bug = raw.select("kenteken","voertuigsoort", "merk", "gem_lading_wrde")
if i pass the bug dataframe to the profiler, i get the division by 0 error . if you can't wait for the fix in the source, applying fillna to the dataframe circumvents the issue.
fixed = raw.fillna("")
I have the problem with the "TypeError: unsupported format string passed to NoneType.format" one, has anyone solved this?
Current Behaviour
Just trying a basic CSV file profiling in a Spark cluster and it's erroring out with the following:
Additionally, install the extra dependency also fails:
with:
Not sure if this is a documentation issue, but it's on the readme.
Expected Behaviour
It should just generate the profile. It works if the dataframe is from Pandas.
Data Description
The titanic dataset. https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv
Code that reproduces the bug
pandas-profiling version
v4.2.0
Dependencies
OS
Spark Cluster
Checklist