Open Fgoudarzi opened 5 months ago
I'm also getting the behavior described above in Databricks using 1.23.5 of numpy and 4.5.1 of ydata_profiling.
I'm using a Personal Compute cluster with 15.2 ML Runtime, 28 GB Memory and 8 Active Cores at 1.5 DBU / h.
For thoroughness. I also did a few tests on Azure Synapse Analytics (ASA) [without Databricks].
If I run this code in ASA:
from ydata_profiling import ProfileReport
from pyspark.sql import Row
df1 = spark.createDataFrame([
Row(c1='Ali',c2='Brown'),
Row(c1='John',c2='Brown'),
Row(c1='Sara',c2='Brown')
])
p2 = ProfileReport(df1)
p2
I get the error: Py4JJavaError: An error occurred while calling z:org.apache.spark.ml.stat.Correlation.corr. : java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.
But if I simply add a numeric column at the end (Per Suggestion from Anomaly Author)
from ydata_profiling import ProfileReport
from pyspark.sql import Row
df1 = spark.createDataFrame([
Row(c1='Ali',c2='Brown',c3=1),
Row(c1='John',c2='Brown',c3=2),
Row(c1='Sara',c2='Brown',c3=3)
])
p2 = ProfileReport(df1)
p2
It runs fine...
I talked to the author of this anomaly report and understood her to say that ProfileReport will probably fail when all of the spark.createDataFrame columns are strings.
This behavior seems to be happening in both Azure Databricks and ASA Spark.
Spark Dependencies ydata_profiling 4.8.3 numpy 1.23.5
Spark Pool Settings:
Hi @Fgoudarzi ,
thank you for your request. Have you tried to generate the report while following this tutorial? https://www.databricks.com/blog/2023/04/03/pandas-profiling-now-supports-apache-spark.html
Current Behaviour
I'm making a very simple Spark dataframe with only one column. Apparently, ProfileReport does not generate the report when I am using Databricks notebook.: Below is the code that I'm using:
But if I convert the dataframe to panda, then it will generate the report:
Expected Behaviour
Generate the report as it does when I convert the Spark dataframe to Panda.
Data Description
Generated in the code.
Code that reproduces the bug
pandas-profiling version
ydata_profiling = 4.8.3
Dependencies
OS
Windows 11 Enterprise
Checklist