wesselhuising / pandantic

Enriches the Pydantic BaseModel class by adding the ability to validate dataframes using the schema and custom validators of the same BaseModel class.
https://pandantic-rtd.readthedocs.io
28 stars 3 forks source link

Construct as a Pandas plugin #15

Open xaviernogueira opened 11 months ago

xaviernogueira commented 11 months ago

Hi! So I was thinking of making a very similar project with one core difference: having the validator function as a Pandas plugin that takes a Pydantic BaseModel or Dataclass as an input.

For example:

df.pandantic.validate(schema: pydantic.BaseModel | pydantic.dataclasses.dataclass)

See: https://pandas.pydata.org/docs/development/extending.html

Wondering what you think about this refactor? I like the idea of being more agnostic to the type of Pydantic schema object being passed in, as Dataclasses are more analogous to a pandas data frame.

Additionally, it allows one to import and use normal Pydantic, instead of a wrapper. Normal pandas can be used too given the plugin is imported.

If you are amenable to this idea, I am happy to make a PR. Otherwise I may just make my own project pandas-pydantic. I would keep your logic largely the same, and test whether it works with dataclasses as well.

xaviernogueira commented 11 months ago

Another option would be to create the pandas plugin from a shared set of functions such that either pattern works. This could be a good option to preserve backwards compatibility, and on second thought may be best. Thoughts? @wesselhuising

wesselhuising commented 11 months ago

Hi @xaviernogueira ,

Thank you for your interest and taking the time to write the suggestions down. I was not aware of the extending functionality of pandas, which is indeed nice as it doesn't need to be a fork of any of the two dependancies (currently).

The only challenge I feel is that by doing so, you would devote the whole project just to one DataFrame package (in this case pandas). The ambition is to be agnostic to the type of dataframe (let's say polars or spark dfs for example) and not be agnostic to the schema object type (in this case dataclasses). How would you suggest approaching this ambition?

xaviernogueira commented 11 months ago

Hi @wesselhuising thanks for the response! That is a valid point, I did not realize that was your roadmap. IMO being agnostic on both sides (schema and data frame) is probably best, mainly because the schema implementations are relatively similar.

Regarding implementation, I still think that inheriting from BaseModel is not the ideal approach. Idk if you are familiar with Python Protocols and dependency injection as a concept, but this is a classic use case for it!

I would start by defining the protocol for both dataclass and basemodel validation. The way to think about this is that we are defining an interface that we can expect. A static type checker will make sure that the class being referenced where one of the protocols is contains the expected function signature (see here).

# shared_types.py ... or something like that
import typing
import pandas as pf
import polars
import pydantic

DataFrameTypes = typing.Union[pd.DataFrame, polars.DataFrame]
SchemaTypes = typing.Union[pydantic.BaseModel, pydantic.dataclasses.dataclass]

@typing.runtime_checkable # prevents non-protocol classes from being used at runtime
class SupportsValidation(typing.Protocol):

     def dataclass_validate(self, schema: pydantic.dataclasses.dataclass, df: DataFrameTypes):
          ...
     def model_validate(self, schema: pydantic.BaseModel, df: DataFrameTypes):
          ...

Next, in a different file, I would define a function that either is initialized with a schema, and takes any dataframe type as an argument for validate(). This class will have the responsibility of fetching the correct protocol implementation for each class. See below.

# validator.py
from shared_types import (
     DataFrameTypes,
     SchemaTypes,
     SupportsValidation,
)

class DataFrameValidator:
     def __init__(self, schema: SchemaTypes, ...)
          self.schema = schema
          ...

     @property
     def validator_function(self):
          if issubclass(self.schema, pydantic.BaseModel):
                 return 'model_validate'
          elif issubclass(self.schema, pydantic.dataclasses.dataclass):
                 return 'dataclass_validate'
          else: raise TypeError(...)
           ...

     @staticmethod
     def get_implementation(df: DataFrameTypes) -> SupportsValidation:
           """"Returns the dataclass library specific class that meets at least one protocol"""
           ...

     def validate(self, df: DataFrameTypes, ....)
          implementation: SupportsValidation = self.get_implementation(df)
          return getattr(implementation, self.validator_function)

That would be it basically! So then all your existing code would be in a pandas "implementation of the SupportsValidation protocol.

Advantages:

Thoughts? I am happy to hop on a call with you at some point if you are interested in making this happen. I think this is a very useful library you have, and it deserves to be well-structured for expansion!

wesselhuising commented 11 months ago

Hi @xaviernogueira ,

Thank you again for your in-depth reply. I definitely agree with the fact that the dependency injection of Pydantic is not ideal, I wanted to mimic the parse_obj method from their API by creating the parse_df method, but the result is indeed that the package is more like a fork over a stand-alone package. So I definitely are open to a refactor like the one you proposed.

The only thing is that by adding a class like DataFrameValidator is an extra import, where the current approach does only need one import as it is a subclass of the BaseModel.

I would like to have a call and pick your brain on this, it is something I think we can look into as creating another package sounds cumbersome to me. Can you add me on LinkedIn?

xaviernogueira commented 11 months ago

Adding you! Managed to catch covid and feel terrible so let me get back to you in a few days. @wesselhuising