ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.44k stars 1.68k forks source link

Common Values incorrectly reporting (missing) #1429

Open danhosanee opened 1 year ago

danhosanee commented 1 year ago

Current Behaviour

OS:Mac Python:3.11 Interface: Jupyter Lab pip: 22.3.1

dataset

DEPARTURE_DELAY ARRIVAL_DELAY DISTANCE SCHEDULED_DEPARTURE
-11.0 -22.0 1448 0.08333333333333333
-8.0 -9.0 2330 0.16666666666666666
-2.0 5.0 2296 0.3333333333333333
-5.0 -9.0 2342 0.3333333333333333
-1.0 -21.0 1448 0.4166666666666667

It appears that when a column value_counts exceeds 200 within the common values section:

It overall contradicts Missing and Missing(%) main statistics for a variable

image

Expected Behaviour

The (Missing) section within Common values should be removed or the difference to "other values"

Data Description

https://github.com/plotly/datasets/blob/master/2015_flights.parquet

Code that reproduces the bug

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from ydata_profiling import ProfileReport
import json

spark = SparkSession.builder.appName("ydata").getOrCreate()

spark_df = spark.read.parquet("ydata-test/2015_flights.parquet")

n_notnull = spark_df.filter(F.col("SCHEDULED_DEPARTURE").isNotNull()).count() 

profile = ProfileReport(spark_df, minimal=True)

value_counts_values = sum(json.loads(profile.to_json())["variables"]["SCHEDULED_DEPARTURE"]["value_counts_without_nan"].values())

missing_common_values = 1650418 # as per html report

assert missing_common_values ==  (n_notnull - value_counts_values)

pandas-profiling version

4.5.1

Dependencies

ydata-profiling==4.5.1
pyspark==3.4.1
pandas==2.0.3
numpy==1.23.5

OS

macos

Checklist

danhosanee commented 8 months ago

@fabclmnt @aquemy - This issue can be resolved if the limit(200) is removed from describe_counts_spark.py. I know there are comments regarding performance, but I think use-ability overrides performance concerns at this point as the limit is causing incorrect outputs as shown above. Happy to create a PR with unitests if allowed?

In terms of improving this are we open to making a spark verison for freq_table ?