Open danhosanee opened 1 year ago
@fabclmnt @aquemy - This issue can be resolved if the limit(200)
is removed from describe_counts_spark.py. I know there are comments regarding performance, but I think use-ability overrides performance concerns at this point as the limit is causing incorrect outputs as shown above. Happy to create a PR with unitests if allowed?
In terms of improving this are we open to making a spark verison for freq_table
?
Current Behaviour
OS:Mac Python:3.11 Interface: Jupyter Lab pip: 22.3.1
dataset
It appears that when a column value_counts exceeds 200 within the
common values
section:(missing)
It overall contradicts Missing and Missing(%) main statistics for a variable
Expected Behaviour
The
(Missing)
section withinCommon values
should be removed or the difference to "other values"Data Description
https://github.com/plotly/datasets/blob/master/2015_flights.parquet
Code that reproduces the bug
pandas-profiling version
4.5.1
Dependencies
OS
macos
Checklist