zeroSteiner / rule-engine

A lightweight, optionally typed expression language with a custom grammar for matching arbitrary Python objects.
https://zerosteiner.github.io/rule-engine/
BSD 3-Clause "New" or "Revised" License
455 stars 54 forks source link

Adding Pandas Dataframe as Datatype? #84

Closed PascalStehling closed 6 months ago

PascalStehling commented 7 months ago

Hi,

firstly, thanks for the nice work so far, just found this yesterday :)

I really like the Idea of having a query engine to filter through piles of tabular data (like CSV). The Problem is that the performance gets quite bad if the data is big enough and there are alot of rules. (In my case as Data Engineer, 100.000 to millions of rows and multiple dozens of rules).

So I was thinking if it would be useful to integrated pandas (the defacto dataframe standard for python) to your query engine. With that alot of operations could be run not just simply on each row via a dict, but on the whole column at once, via vectorization.

I played a little with you Code and got a very simple version working (see screenshot), with just some operators (just equal, not equal, and, or)

image

The resulting array of bools can be used directly on the dataframe to just get the rows that are needed or perform other work.

I also performed a small benchmark with 400.000 rows and as the results show a speedup of more then 30.000 times. (see second screenshot)

image

Now to my question, are you interested in some implementation like this? If so are you willing to do it, otherwise I can also try (still need to understand the Codebase better, only worked for ~2 hrs or so on it) and make a pull request when Im finished if this ok with you and you have time/interest to review it and help to improve it. Also if you see some major problems in doing it, please tell me.

Function wise, nearly everything that you can do with the current types, can also be done with pandas. There are also methods for string search on whole arrays and stuff like that.

Thanks for reading until the End and have a nice day Best Pascal :)

zeroSteiner commented 7 months ago

I would potentially be interested in this. I don't personally use Pandas but I've definitely heard of it and I can understand the value in performance and compatibility with such a popular library.

It doesn't exactly sound like this would truly be a new rule engine data type, those are a lot of work to write. It sounds like this would likely involve some updates in key places to allow mapping Pandas values as inputs as native Python values currently are.

Would you mind sharing the edits you've made just far in a patch, a draft PR or something so I could take a look at the direction you took to achieve the performance improvements you've already shown?

PascalStehling commented 7 months ago

I created a draft PR :)

For alot of operands it should be fairly trivial to put it in. You opinion on the code provided would really interest me.

zeroSteiner commented 7 months ago

Thanks for sending it to me. I'll try and take some time to review in this weekend.