ydataai / ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
https://docs.profiling.ydata.ai
MIT License
12.21k stars 1.65k forks source link

Feat: Use ibis as single backend #1552

Open NickCrews opened 4 months ago

NickCrews commented 4 months ago

Missing functionality

I use ibis. I would love to be able to profile Ibis Tables, as I brought up in their issue tracker.

Proposed feature

If we went about supporting ibis, since ibis already can handle pandas and spark dataframes, then the logical thing would be to re-implement all the core logic you have in ibis. Then you will guarantee consistency between the current pandas and spark implementations (there will only be one implementation now!), plus you get the benefit of supporting all the backends that ibis supports, like sqlite, polars, bigquery, athena, dask, etc etc.

Alternatives considered

convert all these other dataframe formats to pandas/pyspark, and then use this. This is hard for larger-than-memory tables.

Additional context

I only very briefly browsed through your codebase, so I'm not sure how big of a task this would be.

fabclmnt commented 4 months ago

Hi @NickCrews , this is quite a bit task, specially considering that several methods native to both pandas and spark are used.

Moreover, this task involves reliance on a library that is less established compared to both pandas and PySpark. Historically, adopting less established third-party packages has presented difficulties in maintaining ydata-profiling alongside updates to Python versions.

We will keep this feature request open, considering it for potential future integration, should there be significant interest or demand from the community.

NickCrews commented 4 months ago

Thanks @fabclmnt , those concerns really make sense from the maintainership points of view. I think this path forward makes sense. In that issue I linked, the ibis maintainers expressed some interest in helping sister projects be more compatible with ibis, so possibly you could offload some of this work to them if you ever wanted to move forward.

Just curious, what are the functionalities that use native pandas and pyspark APIs that ibis doesn't/can't handle? I may write my own simple version of this lib for ibis, and would love to avoid implementing 3/4 of it before I hit some insurmountable brick wall.

deepyaman commented 3 months ago

@fabclmnt Completely agree with @NickCrews; those concerns make sense.

For Pandera, we've aligned on an approach of contributing an Ibis backend (to support a lot of the database backends Ibis natively supports) in addition to having the existing backends for pandas, Polars, Spark. Rather than a refactoring of existing code to support pandas and Spark in ydata-profiling, would you be open to the contribution of an Ibis backend in the core repo? We could do so in a fork initially.

Historically, adopting less established third-party packages has presented difficulties in maintaining ydata-profiling alongside updates to Python versions.

With respect to this, Ibis is quite relaxed around how it defines dependencies, and furthermore all of the backend dependencies (e.g. for profiling on Postgres) would be treated as extras (i.e. ydata-profiling[postgres] could depend on ibis-framework[postgres]). Not sure if that alleviates some of the concerns, but happy to hear what thought you have!