Closed BurnzZ closed 1 year ago
I think we have several options on how to implement it.
The idea is that all processing logic would be implemented as decorators, which you can use to decorate extraction methods (later to be decorated with @field
). Example:
@field
@clean_str
def name(self):
return " some value \n\n\t"
Pro:
@field
Cons:
async def
code as well, so it's even more complicated implementation@field
decorator@field
decorator should be at the top (not a big deal though)Extraction decorators can be pre-decorated with @field
:
@clean_str_field
def name(self):
return " some value \n\n\t"
# or
from zyte_common_items import fields
# ...
@fields.clean_str
def name(self):
return " some value \n\n\t"
# or
from zyte_common_items.product import fields
# ...
@fields.name
def name(self):
return " some value \n\n\t"
It's very similar to (1. Decorators) approach. Pros and Cons compared to (1):
Pro:
@field
decorator@field
vs other decoratorsCons:
To be continued with other options :)
One can write regular processing functions, and use them in the fields:
@field
def name(self):
return clean_str(" some value \n\n\t")
Pros:
Cons:
@field
decorator with processors supportWe can modify @field
decorator to support output processors:
@field(out=clean_str)
def name(self):
return " some value \n\n\t"
# or multiple processors:
@field(out=[clean_str, str.title])
def name(self):
return " some value \n\n\t"
Pro:
@field
decorator vs processing decoratorsCons:
Instead of supporting a list in field processors, we can recommend using some FP library:
from toolz.functoolz import compose_left
# ...
@field(out=compose_left(clean_str, str.title))
def name(self):
return " some value \n\n\t"
Pros:
from toolz.functoolz import curry
@curry
def clean_str(input, normalize_space=True):
# ...
# page objects
from toolz.functoolz import compose_left
# ...
@field(out=compose_left(clean_str, str.title))
def name2(self):
return " some value \n\n\t"
# no need to use functools.partial, this works
@field(out=compose_left(clean_str(normalize_space=False), str.title))
def name(self):
return " some value \n\n\t"
Cons:
To be continued.
We can implement Option 3, and provide some helpers for creation of combined "fields + processing" decorators. In the simplest case, stdlib is enough:
from functools import partial
from web_poet import field
name_field = partial(field, out=[clean_str])
# ...
# parameters work
@name_field(cached=True)
def name(self):
return " some value \n\n\t"
We may have something more advanced to support customizing out
better, but naive implementation could be enough: name_field(out=...)
replaces default out
. An argument in favor of doing something custom with out
(e.g. appending or prepending to the default processing) is that name_field(out=...)
might be useless, one can just use use field(out=...)
. But we might add something else to the field, e.g. cache some fields by default, or pass some meta
parameter.
I'm currently in favor of going with Option 5, because
@field(out=...)
.+1 to 3, +0.5 to 5.
I find @field(cache=True, out=[clean_str])
more readable than @name_field(cached=True)
, I think it makes it more obvious that some cleanup is being done, and which one.
If a long out
value is the issue, I would suggest a slightly different approach:
clean_name = [clean_str]
@field(cache=True, out=clean_name)
def name(self):
pass
Closing this, as processors feature is implemented in web-poet, and zyte-common-items already gained some built-in processors.
Stemming off from the discussion in https://github.com/zytedata/zyte-common-items/pull/15.
We need a library of functions that can be used to preprocess the data values placed into the item. Ideally, this should be used by https://github.com/scrapinghub/web-poet fields. For example:
A guideline in web-poet should be created as well to properly document the processors found in zyte-common-items.