zytedata / zyte-common-items

Contains the common item definitions used in Zyte.
BSD 3-Clause "New" or "Revised" License
9 stars 6 forks source link

A collection of processors to clean-up field values #18

Closed BurnzZ closed 1 year ago

BurnzZ commented 2 years ago

Stemming off from the discussion in https://github.com/zytedata/zyte-common-items/pull/15.


We need a library of functions that can be used to preprocess the data values placed into the item. Ideally, this should be used by https://github.com/scrapinghub/web-poet fields. For example:

from web_poet import ItemPage, field
from zyte_common_items import Product
from zyte_common_items.processors import clean_str

class ProductPage(ItemPage[Product]):
    @field
    @clean_str
    def name(self):
        return "  some value \n\n\t"

page = ProductPage()
assert page.name == "some value"
assert page.name == page.to_item().name

A guideline in web-poet should be created as well to properly document the processors found in zyte-common-items.

kmike commented 2 years ago

I think we have several options on how to implement it.

Option 1: Decorators

The idea is that all processing logic would be implemented as decorators, which you can use to decorate extraction methods (later to be decorated with @field). Example:

    @field
    @clean_str
    def name(self):
        return "  some value \n\n\t"

Pro:

Cons:

Option 1.1: Field decorators

Extraction decorators can be pre-decorated with @field:

    @clean_str_field
    def name(self):
        return "  some value \n\n\t"

# or
from zyte_common_items import fields

# ...
    @fields.clean_str
    def name(self):
        return "  some value \n\n\t"

# or
from zyte_common_items.product import fields

# ...
    @fields.name
    def name(self):
        return "  some value \n\n\t"

It's very similar to (1. Decorators) approach. Pros and Cons compared to (1):

Pro:

Cons:

To be continued with other options :)

kmike commented 2 years ago

Option 2. Regular functions

One can write regular processing functions, and use them in the fields:

    @field
    def name(self):
        return clean_str("  some value \n\n\t")

Pros:

Cons:

Option 3: @field decorator with processors support

We can modify @field decorator to support output processors:

    @field(out=clean_str)
    def name(self):
        return "  some value \n\n\t"

# or multiple processors:

    @field(out=[clean_str, str.title])
    def name(self):
        return "  some value \n\n\t"

Pro:

Cons:

Option 4: field output processors + recommend some functional programming library

Instead of supporting a list in field processors, we can recommend using some FP library:

from toolz.functoolz import compose_left
# ...

    @field(out=compose_left(clean_str, str.title))
    def name(self):
        return "  some value \n\n\t"

Pros:

from toolz.functoolz import curry

@curry
def clean_str(input, normalize_space=True):
    # ...

# page objects
from toolz.functoolz import compose_left

# ...

    @field(out=compose_left(clean_str, str.title))
    def name2(self):
        return "  some value \n\n\t"

    # no need to use functools.partial, this works
    @field(out=compose_left(clean_str(normalize_space=False), str.title))
    def name(self):
        return "  some value \n\n\t"

Cons:

To be continued.

kmike commented 2 years ago

Option 5: combine Option 3 and Option 1.1

We can implement Option 3, and provide some helpers for creation of combined "fields + processing" decorators. In the simplest case, stdlib is enough:

from functools import partial
from web_poet import field

name_field = partial(field, out=[clean_str])

# ...
    # parameters work
    @name_field(cached=True)
    def name(self):
        return "  some value \n\n\t"

We may have something more advanced to support customizing out better, but naive implementation could be enough: name_field(out=...) replaces default out. An argument in favor of doing something custom with out (e.g. appending or prepending to the default processing) is that name_field(out=...) might be useless, one can just use use field(out=...). But we might add something else to the field, e.g. cache some fields by default, or pass some meta parameter.

I'm currently in favor of going with Option 5, because

  1. Writing a processor function is straightforward, it's a regular Python function. Testing it is straightforward as well. No new concepts. I think that's important, because developers shouldn't be only using the standard processing functions provided by libraries, they should also write their own processing code.
  2. Using custom processing functions is straightforward, use @field(out=...).
  3. It's possible to provide shortcuts, so common cases can be optimized from the usage point of view. We don't need to settle on the shortcuts from the day 1, it can be done iteratively.
Gallaecio commented 2 years ago

+1 to 3, +0.5 to 5.

I find @field(cache=True, out=[clean_str]) more readable than @name_field(cached=True), I think it makes it more obvious that some cleanup is being done, and which one.

If a long out value is the issue, I would suggest a slightly different approach:

clean_name = [clean_str]

@field(cache=True, out=clean_name)
def name(self):
    pass
kmike commented 1 year ago

Closing this, as processors feature is implemented in web-poet, and zyte-common-items already gained some built-in processors.