ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.37k stars 1.67k forks source link

open to contributions and a collaboration with pandera? #856

Open cosmicBboy opened 2 years ago

cosmicBboy commented 2 years ago

Hi! First of all I'm a big fan of the library šŸŽ‰, I've been using it myself from its early days to now.

Missing functionality

pandera is a data validation library that makes it easy to define dataframe types and do run-time validation via type-annotations. It currently does a little bit of data profiling in order to support the schema inference feature, but I think it makes a lot of sense to leverage pandas-profiling's more advanced capabilities on this front (e.g. the statistical summaries, which in theory could be converted into hypothesis tests in pandera).

I was wondering if the pandas-profiling maintainers would be open to a contribution for functionality similar to the great expectations integration?

Proposed feature

The user API would be super straight-forward:

from pandas_profiling import ProfileReport
df = ...
profile = ProfileReport(df)
pandera_schema = profile.to_pandera_schema()

# validate the data itself
pandera_schema(df)  # should pass

# validate new data
new_df = ...
pandera_schema(new_df)  # may fail

Under the hood, pandera would use the profile.get_description() summary or profile.to_json() to construct a pandera schema, which users could then use directly in their script/notebook, or serialize with schema.to_yaml() or schema.to_script() if they want to reuse the schema in some other process.

I think it makes sense to implement the parsing/reconstruction logic on pandas-profiling because I want to be able to adopt the pattern of converting the vision typeset into the pandera type system (which are basically just aliases of numpy/pandas machine types), and looking at the great expectations integration it seems like pandas-profiling has a nice set of abstractions for handling the complexity of converting profiles to a data validation format.

On the pandera side, I'd want to add schema.from_profile to be able to read an in-memory or serialized profile (in json for example).

Alternatives considered

Implement the profile -> schema logic in pandera. This is possible, but as I mention above, I think using the type abstractions in pandas-profiling would be make for a smoother integration.

sbrugman commented 2 years ago

Hi @cosmicBboy, sounds interesting. Let's have a chat on the Slack channel.

maltequast commented 1 year ago

Hi, is there any update on it?

fabclmnt commented 1 year ago

@cosmicBboy Is this still a relevant development? I would be thrilled to discuss ideas.