ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.57k stars 1.69k forks source link

Databricks profiling report while using ydata-profiling #1605

Open Fgoudarzi opened 5 months ago

Fgoudarzi commented 5 months ago

Current Behaviour

I'm making a very simple Spark dataframe with only one column. Apparently, ProfileReport does not generate the report when I am using Databricks notebook.: Below is the code that I'm using:

from ydata_profiling import ProfileReport
from pyspark.sql import Row
df1 = spark.createDataFrame([
    Row(name='Ali'),
    Row(name='John'),
    Row(name='Sara'),
    Row(name='John')
])
p2 = ProfileReport(df1)
p2

S1

But if I convert the dataframe to panda, then it will generate the report: S2

Expected Behaviour

Generate the report as it does when I convert the Spark dataframe to Panda.

Data Description

Generated in the code.

Code that reproduces the bug

from ydata_profiling import ProfileReport
from pyspark.sql import Row
df1 = spark.createDataFrame([
    Row(name='Ali'),
    Row(name='John'),
    Row(name='Sara'),
    Row(name='John')
])
p2 = ProfileReport(df1)
p2

pandas-profiling version

ydata_profiling = 4.8.3

Dependencies

ydata_profiling = 4.8.3
numpy = 1.24.4

OS

Windows 11 Enterprise

Checklist

shawn-eary commented 5 months ago

I'm also getting the behavior described above in Databricks using 1.23.5 of numpy and 4.5.1 of ydata_profiling.

I'm using a Personal Compute cluster with 15.2 ML Runtime, 28 GB Memory and 8 Active Cores at 1.5 DBU / h.

shawn-eary commented 5 months ago

For thoroughness. I also did a few tests on Azure Synapse Analytics (ASA) [without Databricks].

If I run this code in ASA:

from ydata_profiling import ProfileReport
from pyspark.sql import Row
df1 = spark.createDataFrame([
    Row(c1='Ali',c2='Brown'),
    Row(c1='John',c2='Brown'),
    Row(c1='Sara',c2='Brown')
])
p2 = ProfileReport(df1)
p2

I get the error: Py4JJavaError: An error occurred while calling z:org.apache.spark.ml.stat.Correlation.corr. : java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.

But if I simply add a numeric column at the end (Per Suggestion from Anomaly Author)

from ydata_profiling import ProfileReport
from pyspark.sql import Row
df1 = spark.createDataFrame([
    Row(c1='Ali',c2='Brown',c3=1),
    Row(c1='John',c2='Brown',c3=2),
    Row(c1='Sara',c2='Brown',c3=3)
])
p2 = ProfileReport(df1)
p2

It runs fine... image

I talked to the author of this anomaly report and understood her to say that ProfileReport will probably fail when all of the spark.createDataFrame columns are strings.

This behavior seems to be happening in both Azure Databricks and ASA Spark.

Spark Dependencies ydata_profiling 4.8.3 numpy 1.23.5

Spark Pool Settings:

image

fabclmnt commented 4 months ago

Hi @Fgoudarzi ,

thank you for your request. Have you tried to generate the report while following this tutorial? https://www.databricks.com/blog/2023/04/03/pandas-profiling-now-supports-apache-spark.html