A collection of processors to clean-up field values

BurnzZ commented 2 years ago

Stemming off from the discussion in https://github.com/zytedata/zyte-common-items/pull/15.

We need a library of functions that can be used to preprocess the data values placed into the item. Ideally, this should be used by https://github.com/scrapinghub/web-poet fields. For example:

from web_poet import ItemPage, field
from zyte_common_items import Product
from zyte_common_items.processors import clean_str

class ProductPage(ItemPage[Product]):
    @field
    @clean_str
    def name(self):
        return "  some value \n\n\t"

page = ProductPage()
assert page.name == "some value"
assert page.name == page.to_item().name

A guideline in web-poet should be created as well to properly document the processors found in zyte-common-items.

kmike commented 2 years ago

I think we have several options on how to implement it.

Option 1: Decorators

The idea is that all processing logic would be implemented as decorators, which you can use to decorate extraction methods (later to be decorated with @field). Example:

    @field
    @clean_str
    def name(self):
        return "  some value \n\n\t"

Pro:

easy to use
decorators with parameters can be defined
it's possible to combine several decorators
can be used on methods which are not @field

Cons:

more complex implementation; every processor needs to be implemented as a decorator
these decorators must handle async def code as well, so it's even more complicated implementation
code is harder to use if a decorator API is not desirable
almost always you'd need @field decorator
if several decorators are applied, it's not always obvious in which order they'll be called
one need to remember that @field decorator should be at the top (not a big deal though)

Option 1.1: Field decorators

Extraction decorators can be pre-decorated with @field:

    @clean_str_field
    def name(self):
        return "  some value \n\n\t"

# or
from zyte_common_items import fields

# ...
    @fields.clean_str
    def name(self):
        return "  some value \n\n\t"

# or
from zyte_common_items.product import fields

# ...
    @fields.name
    def name(self):
        return "  some value \n\n\t"

It's very similar to (1. Decorators) approach. Pros and Cons compared to (1):

Pro:

easier to use, no need to use @field decorator
no need to recall the order of @field vs other decorators

Cons:

harder to combine several processing decorators, not possible with naive implementation
more complex implementation
can't be used on regular methods, which are not fields

To be continued with other options :)

kmike commented 2 years ago

Option 2. Regular functions

One can write regular processing functions, and use them in the fields:

    @field
    def name(self):
        return clean_str("  some value \n\n\t")

Pros:

Implementation is very straightforward
It's possible to combine the functions in arbitrary way
Easy to reuse in all contexts
No need to worry about sync/async semantics

Cons:

If there are several "return points", more code changes might be required, to ensure cleaning is always applied
It might be easier to forget to call a function
It is less clear, from the field definition, what kind of cleaning is performed, as the cleaning code is mixed with extraction code

Option 3: `@field` decorator with processors support

We can modify @field decorator to support output processors:

    @field(out=clean_str)
    def name(self):
        return "  some value \n\n\t"

# or multiple processors:

    @field(out=[clean_str, str.title])
    def name(self):
        return "  some value \n\n\t"

Pro:

Processors are still regular functions, so all the pros of regular functions apply
Processing logic is not mixed with the extraction logic, similar to decorators
No confusion about order of @field decorator vs processing decorators

Cons:

To use parameters with processing functions one need to use functools.partial
If multiple processors are needed, a list may look cluttered (not sure)
Typing may be more tricky to get right (though not sure about it)
We'd need to think how to handle sync vs async processing functions. Probably, only support sync.

Option 4: field output processors + recommend some functional programming library

Instead of supporting a list in field processors, we can recommend using some FP library:

from toolz.functoolz import compose_left
# ...

    @field(out=compose_left(clean_str, str.title))
    def name(self):
        return "  some value \n\n\t"

Pros:

The same as in Option 3.
Instead of functools.partial we might promote currying, which makes for a nicer API for passing parameters:

from toolz.functoolz import curry

@curry
def clean_str(input, normalize_space=True):
    # ...

# page objects
from toolz.functoolz import compose_left

# ...

    @field(out=compose_left(clean_str, str.title))
    def name2(self):
        return "  some value \n\n\t"

    # no need to use functools.partial, this works
    @field(out=compose_left(clean_str(normalize_space=False), str.title))
    def name(self):
        return "  some value \n\n\t"

Cons:

Not sure what's the advantage of promoting compose/compose_left, over supporting a list.

To be continued.

kmike commented 2 years ago

Option 5: combine Option 3 and Option 1.1

We can implement Option 3, and provide some helpers for creation of combined "fields + processing" decorators. In the simplest case, stdlib is enough:

from functools import partial
from web_poet import field

name_field = partial(field, out=[clean_str])

# ...
    # parameters work
    @name_field(cached=True)
    def name(self):
        return "  some value \n\n\t"

We may have something more advanced to support customizing out better, but naive implementation could be enough: name_field(out=...) replaces default out. An argument in favor of doing something custom with out (e.g. appending or prepending to the default processing) is that name_field(out=...) might be useless, one can just use use field(out=...). But we might add something else to the field, e.g. cache some fields by default, or pass some meta parameter.

I'm currently in favor of going with Option 5, because

Writing a processor function is straightforward, it's a regular Python function. Testing it is straightforward as well. No new concepts. I think that's important, because developers shouldn't be only using the standard processing functions provided by libraries, they should also write their own processing code.
Using custom processing functions is straightforward, use @field(out=...).
It's possible to provide shortcuts, so common cases can be optimized from the usage point of view. We don't need to settle on the shortcuts from the day 1, it can be done iteratively.

Gallaecio commented 2 years ago

+1 to 3, +0.5 to 5.

I find @field(cache=True, out=[clean_str]) more readable than @name_field(cached=True), I think it makes it more obvious that some cleanup is being done, and which one.

If a long out value is the issue, I would suggest a slightly different approach:

clean_name = [clean_str]

@field(cache=True, out=clean_name)
def name(self):
    pass

kmike commented 1 year ago

Closing this, as processors feature is implemented in web-poet, and zyte-common-items already gained some built-in processors.

zytedata / zyte-common-items