-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowing subfields to be extracted #115
Comments
Should we consider making an API change to the page object API? I am thinking that if we put the field-filtering logic in |
Thanks @Gallaecio . Following your suggestion alongside @kmike's (during our call), we can compare the following approaches. First, let's assume we have the following items: @attrs.define
class Item:
x: int
y: Optional[int] = None
z: Optional[int] = None
@attrs.define
class SmallItem:
"""Doesn't have the required 'x' field."""
y: Optional[int] = None
z: Optional[int] = None Approach 1. Modifying
|
The code shouldn't break even if scrapy-poet user decides to exclude fields. I.e. we should support a case where page object has to_item method without parameters, but user requests some fields to be excluded. You can implement it by checking the signature, and not passing the parameters if they're not supported. The issue is that it's more code, which is not needed if we use dependency approach.
Does it mean you can either pass include fields, or exclude fields, but not both at the same time? It also seems it might be not very convenient to write code which uses select_fields: def to_item(self):
item = {}
if not (isinstance(self.select_fields, ExcludeFields) and "price" in self.select_fields):
# ... extract price
# fixme: IncludeFields are not handled.
item["price"] = self.css(...)
... Handling both may be even trickier; you also need to handle a self.select_fields=None case. Could you please try writing an example? How can we make it more manageable? I was thinking about using a single dependency for it: select_fields: IncludeExcludeOptions
# ...
def to_item(self):
item = {}
if self.select_fields.not_excluded("price"):
# this should handle both include and exclude options; also, None handling is not needed
item["price"] = self.css(...) |
Yes. Though we can include the possibility of What do you think about this modified Approach 2? @attrs.define
class SelectFields:
# could be Union[Iterable[str], str] to support a terse usage.
include: Optional[Iterable[str]] = None
exclude: Optional[Iterable[str]] = None
# Compatible with the default skip_nonitem_fields=False
on_unknown_field: Literal["ignore", "warn", "raise"] = "raise"
swap_item_cls: Optional[Type] = None
@attrs.define
class SomePage(WebPage[Item]):
select_fields: Optional[SelectFields] = None
def to_item(self):
...
page = SomePage(response, select_fields=SelectFields(include=["x", "y"]))
page.to_item() # Item(x=1, y=2, z=None)
page = SomePage(response, select_fields=SelectFields(exclude=["y"]))
page.to_item() # Item(x=1, y=None, z=3)
page = SomePage(response, select_fields=SelectFields(include=("x", "y"), exclude=["y", "z"]))
page.to_item() # Item(x=1, y=None, z=None)
page = SomePage(response, select_fields=SelectFields(exclude=["x"], swap_item_cls=SmallItem))
page.to_item() # SmallItem(y=2, z=3) The implementation should be more straightforward as it becomes a set operation (i.e. I think we can even take a step further by implemeting page = SomePage(response, select_fields=SelectFields(include=["x", "y"]))
page.to_item() # Item(x=1, y=2, z=None)
page.to_partial_item() # Item(x=1, y=2, z=None)
page = SomePage(response)
page.to_item() # Item(x=1, y=2, z=3)
page.to_partial_item(SelectFields(include=["x", "y"])) # Item(x=1, y=2, z=None) I think this can be re-used in scrapy-poet as well: class MySpider(scrapy.Spider):
# some code
def parse_item(
self,
response,
item: Annotated[Item, SelectFields(exclude=["z"])],
):
yield item or class MySpider(scrapy.Spider):
# some code
def parse_home(self, response):
yield response.follow(
some_url,
callback=self.parse_item,
meta={
"select_fields": Annotated[Item, SelectFields(exclude=["z"])])
}
)
# ItemProvider should be able to map the Annotated item from request.meta to
# the item declared in this callback, and perform the appropriate partial extraction.
def parse_item(self, response, item: Item):
yield item |
From the new propsal in #120, we can also have an Approach 3: from web_poet import field
@attrs.define
class SelectFields:
fields: Optional[Dict[str, bool]] = None
# Compatible with the default skip_nonitem_fields=False
on_unknown_field: Literal["ignore", "warn", "raise"] = "raise"
swap_item_cls: Optional[Type] = None
@attrs.define
class SomePage(WebPage[Item]):
select_fields: Optional[SelectFields] = None
@field
def x(self):
return 1
@field
def y(self):
return 2
@field(to_return=False) # 'to_return' could be renamed
def z(self):
return 3
page = SomePage(response)
page.to_item() # Item(x=1, y=2, z=None)
page = SomePage(response, select_fields=SelectFields(fields={"y": False}))
page.to_item() # Item(x=1, y=None, z=None)
page = SomePage(response, select_fields=SelectFields(fields={"x": True, "y": True, "z": True}))
page.to_item() # Item(x=1, y=2, z=2)
page = SomePage(response, select_fields=SelectFields(fields={"*": True}))
page.to_item() # Item(x=1, y=2, z=3)
page = SomePage(response, select_fields=SelectFields(fields={"*": False, "x": True}))
page.to_item() # Item(x=1, y=None, z=None) |
Stemming from scrapinghub/scrapy-poet#111 where we'd want to implement the API in web-poet itself regarding extracting data from a subset of fields.
API
The main directives that we want to support are:
include_fields
exclude_fields
Using both directives together should not be allowed.
via page object instance
This API assumes we already have an instance of the page object with the appropriate response data in it. Moreover, the item class can be inferred from the page object definition:
Arguably, we could also support page object classes as long as the URL is provided for the response data to be downloaded by the configured downloader.
via item class
Conversely, we could also support directly asking for the item class instead of the page object as long as we have access to the
ApplyRule
to infer their relationships. Unlike the page object, a single item class could have relationships to multiple page objects, depending on the URL.But this means that the response should still be downloaded and the downloader is configured.
There are several combinations of scenarios for this type of API.
Page object setup
A. The page object has all fields using the
@field
decoratorThis is quite straightforward to support since we can easily do:
We basically derive all the fields from the page object and call them one-by-one.
B. The page object doesn't use the
@field
decorator but solely utilizes the.to_item()
methodAlternatively, the page object can be defined as:
The methodology mentioned in scenario A above won't work here since calling
get_fields_dict(page_obj)
would result in an emptydict
.Instead, we can simply call the page object's
.to_item()
method and just include/exclude the given fields from there.C. The page object has some fields using the
@field
decorator while some fields are populated inside the.to_item()
methodThis scenario is much harder since calling
get_fields_dict(page_obj)
would result in a non-empty dict:{'x': FieldInfo(name='x', meta=None, out=None)}
.We could try to check if the page object has overridden the
.to_item()
method by something likepage_obj.__class__.to_item == web_poet.ItemPage.to_item
. However, we're also not sure if it has added any new fields at all or has simply overridden it to add some post-processing or validation operations. Either way, the resulting field value from the.to_item()
method (if it's overridden) could be totally different than calling the field directly.We could also detect this scenario whenever some fields specified in
include_fields=[...]
orexclude_fields=[...]
are not present inget_fields_dict(page_obj)
. If so, we can simply call the.to_item()
method and include/exclude fields from there.However, it's a wasteful operation since some fields could be expensive (i.e. having additional requests) and that's why they want to be excluded in the first place. But then, they were still unintentionally called via the
.to_item()
method.In this case, we'd rely on the page object developer to design their page objects well and ensure that our docs highlight this caveat.
But still, there's the question of how to handle fields specified in
include_fields=[...]
orexclude_fields=[...]
that are not existing at all. Let's tackle this in the further sections below (Spoiler: it'd be great to not support scenario C).Handling field presence
I. Some fields specified in
include_fields=[...]
are not existingAn example would be:
For this case, we can simply ignore producing the
z
field value since the page object does not support it.Moreover, if all of the given fields are not existing at all,
partial_item_from_page_obj(some_page, include_fields=["z"])
, an empty item would be returned.Note that this is could be related to scenario C above and we have to be careful since a given field might be declared without using the
@field
decorator.Because of these types of scenarios, it'd be hard to fully trust deriving the fields from a page object via
fields = get_fields_dict(page_obj)
.SOLUTION 1: we can make it clear to our users via our docs that we will only call
.to_item()
if the page object explicitly doesn't use any@field
decorators. This means that we won't be supporting scenario C at all.II. Some fields specified in
exclude_fields=[..]
are not existingThe same case with scenario I where we can simply ignore non-existing fields.
However, it has the same problem about supporting
.to_item()
for scenario C, since there might be some fields that's using the@field
decorator while the rest are produced via the.to_item()
method.To err on the side of caution, it could simply call
.to_item()
and then removing the fields declared inexclude_fields=[...]
. Better yet, go with SOLUTION 1 above as well.III. No fields were given in
include_fields=[...]
For this, we could simply return an item with empty fields.
If any fields are required but are missing (i.e.
None
), we simply let Python error it out:TypeError: __init__() missing 1 required positional argument: ...
.IV. No fields were given in
exclude_fields=[...]
We could return the item with full fields, basically calling the
.to_item()
.Item setup
1. The item has all fields marked as
Optional
There's no issue with this since including or excluding fields won't result into errors like
TypeError: __init__() missing 1 required positional argument: ...
.All of the above examples above has this item setup.
2. The item has fields marked as required
For example:
Unlike in scenario 1, this results in
TypeError: __init__() missing 1 required positional argument : ...
since they
field is required.One solution is to allow overriding the item class that the page object is returning which removes the required fields. Here's an example:
The other API could be:
Summary
We only support page object setups of scenarios A and B while support for C is dropped.
This makes it easier for the item setups in scenario I and II. Scenario III and IV should work whatever the case may be.
The item setup for scenario 1 is straightforward while scenario 2 needs a way to replace/override the item that a page object returns with a smaller version.
The text was updated successfully, but these errors were encountered: