zeroSteiner / rule-engine

A lightweight, optionally typed expression language with a custom grammar for matching arbitrary Python objects.
https://zerosteiner.github.io/rule-engine/
BSD 3-Clause "New" or "Revised" License
455 stars 54 forks source link

Add dataframe support #85

Closed PascalStehling closed 7 months ago

PascalStehling commented 7 months ago

As described in #84, some changes to support the usage of pandas as datatype.

zeroSteiner commented 7 months ago

Okay, thanks for submitting this to me. I see what you're doing here now and have some thoughts.

  1. I'm not a fan of having pandas as a requirement for everyone. I use this library in a lot of small scripts I use and needing to install another package as a dependency sounds tedious. I think we can try to import it and then guard the references to it so it'll work when present but not be required.
  2. Since the same operators appear to work for the Pandas data types, I'm wondering if we can instead update the types module in key places to know that the Pandas values are compatible.
    1. types.coerce_value probably needs to be updated for the ARRAY type to validate the pandas.Series.
    2. types.DataType.from_type needs to be updated to mark that a pandas.Series is a DataType.ARRAY.
    3. types.DataType.from_value needs to also mark that a pandas.Series value is a DataType.ARRAY. This will also require iterable_member_value_type to be updated to ensure the member type is propagated as well.
    4. Some unit tests should be added to validate that the supported operations all work correctly.

Adding the type information for Pandas to the types module I think will be easier to maintain long term, especially since the AST module is already pretty complicated. It would ensure that as new AST operations are added, they won't necessarily need to account for Pandas specifically but this makes the large assumption that pandas.Series works in the same was as a Python tuple and supports all the same operations. If that assumption is incorrect, then things will be much more complicated because the edge cases will need to be addressed.

PascalStehling commented 7 months ago

Thanks for your Assessment. The pd.Series works a little bit like a tuple, but there is much more stuff you could do with it. Eg it would be super interesting and practical to enable regex search across a whole column. Also beging able to do all types of Arithmetic across the column could be quite helpful.

But as you already wrote (and I understand the codebase), there are a lot of places were changes would be necessary, which would not make the Code easier to read.

Maybe other tools would be a better fit, instead of trying to force push such a big framework into a codebase that was not really designed for doing such stuff.

But thanks again for taking your time :)