world-federation-of-advertisers / cardinality_estimation_evaluation_framework

Evaluation framework and methods for estimating cardinalities of groups of sets
Apache License 2.0
22 stars 9 forks source link

Remove pickle. #75

Closed huangxichen1 closed 4 years ago

huangxichen1 commented 4 years ago

Reasons to remove pickle:

  1. It is not used by the output.
  2. Pickle can only serialize the top level class and functions. And in this code base, we are using factory methods a lot, and it is easy to get errors. For example, the stratified sketch VoC and stratified sketch ADBF use sketch_factory, so they are not compatible with pickle. The error message looks like:
    AttributeError: Can't pickle local object 'VectorOfCounts.get_sketch_factory.<locals>.f'

    where <locals> is a class referred by the get_sketch_factory method.

jgoodknight commented 4 years ago

This is interesting... why wasn't it a problem before? I also believe this is why Preston used the pathos multiprocessing library instead of Python's built-in which has to pickle everything it calculates. There's a library in that package called dill that will do the same thing as pickle but works on lambdas you could use instead of getting rid of the pickling if we need it

huangxichen1 commented 4 years ago

This is interesting... why wasn't it a problem before? I also believe this is why Preston used the pathos multiprocessing library instead of Python's built-in which has to pickle everything it calculates. There's a library in that package called dill that will do the same thing as pickle but works on lambdas you could use instead of getting rid of the pickling if we need it

The reason that we didn't see this problem before is that we try our best to avoid lower level class calls. For example, I made some change to the Liquid Legion code to make the LiquidLegion pickle-able.

Yet, it becomes more difficult when we do the frequency evaluation. The frequency estimators may depend on the cardinality estimators and use get_sketch_factory method of the cardinality estimators, which will cause the issue that the class is called on a lower level. For example, the stratified sketch uses cardinality_sketch_factory. Let's say we use VoC, and the pickle will raise AttributeError: Can't pickle local object 'VectorOfCounts.get_sketch_factory.<locals>.f'.

Let me try the dill, and report back if it is possible.

huangxichen1 commented 4 years ago

This is interesting... why wasn't it a problem before? I also believe this is why Preston used the pathos multiprocessing library instead of Python's built-in which has to pickle everything it calculates. There's a library in that package called dill that will do the same thing as pickle but works on lambdas you could use instead of getting rid of the pickling if we need it

The reason that we didn't see this problem before is that we try our best to avoid lower level class calls. For example, I made some change to the Liquid Legion code to make the LiquidLegion pickle-able.

Yet, it becomes more difficult when we do the frequency evaluation. The frequency estimators may depend on the cardinality estimators and use get_sketch_factory method of the cardinality estimators, which will cause the issue that the class is called on a lower level. For example, the stratified sketch uses cardinality_sketch_factory. Let's say we use VoC, and the pickle will raise AttributeError: Can't pickle local object 'VectorOfCounts.get_sketch_factory.<locals>.f'.

Let me try the dill, and report back if it is possible.

Joey: I just tested that the dill library can pickle the configs, which is promising.

So now there are two options:

  1. Change pickle to dill, in case we need the serialized objects in the future.
  2. Remove pickle, as it is not used by any other modules in the framework.

@all, please let me know which one you prefer.