Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing subfields to be extracted #115

Open
BurnzZ opened this issue Jan 20, 2023 · 5 comments
Open

Allowing subfields to be extracted #115

BurnzZ opened this issue Jan 20, 2023 · 5 comments
Labels

Comments

@BurnzZ
Copy link
Contributor

BurnzZ commented Jan 20, 2023

Stemming from scrapinghub/scrapy-poet#111 where we'd want to implement the API in web-poet itself regarding extracting data from a subset of fields.

API

The main directives that we want to support are:

  • include_fields
  • exclude_fields

Using both directives together should not be allowed.

via page object instance

item = partial_item_from_page_obj(product_page, include_fields=["x", "y"])
print(item)  # ProductItem(x=1, y=2, z=None)
print(type(product_page))  # ProductPage

This API assumes we already have an instance of the page object with the appropriate response data in it. Moreover, the item class can be inferred from the page object definition:

class ProductPage(WebPage[ProductItem]):
    ...

Arguably, we could also support page object classes as long as the URL is provided for the response data to be downloaded by the configured downloader.

via item class

Conversely, we could also support directly asking for the item class instead of the page object as long as we have access to the ApplyRule to infer their relationships. Unlike the page object, a single item class could have relationships to multiple page objects, depending on the URL.

But this means that the response should still be downloaded and the downloader is configured.

item = partial_item_from_item_cls(
    ProductItem, include_fields=["x", "y"], url="https://example.com"
)

There are several combinations of scenarios for this type of API.

Page object setup

A. The page object has all fields using the @field decorator

This is quite straightforward to support since we can easily do:

from web_poet.fields import get_fields_dict
from web_poet.utils import ensure_awaitable


fields = get_fields_dict(page_obj)
item_dict = item_from_fields_sync(page_obj, item_cls=dict, skip_nonitem_fields=False)
item = page_obj.item_cls(
    **{
        name: await ensure_awaitable(item_dict[name])
        for name in item_dict
        if name in field_names
    }
)

We basically derive all the fields from the page object and call them one-by-one.

B. The page object doesn't use the @field decorator but solely utilizes the .to_item() method

Alternatively, the page object can be defined as:

class ProductPage(WebPage[ProductItem]):
    def to_item(self) -> ProductItem:
        return ProductItem(x=1, y=2, z=3)

The methodology mentioned in scenario A above won't work here since calling get_fields_dict(page_obj) would result in an empty dict.

Instead, we can simply call the page object's .to_item() method and just include/exclude the given fields from there.

C. The page object has some fields using the @field decorator while some fields are populated inside the .to_item() method

class ProductPage(WebPage[ProductItem]):
    @field
    def x(self) -> int:
        return 1
        
    def to_item(self) -> ProductItem:
        return ProductItem(x=self.x, y=2, z=3)

This scenario is much harder since calling get_fields_dict(page_obj) would result in a non-empty dict: {'x': FieldInfo(name='x', meta=None, out=None)}.

We could try to check if the page object has overridden the .to_item() method by something like page_obj.__class__.to_item == web_poet.ItemPage.to_item. However, we're also not sure if it has added any new fields at all or has simply overridden it to add some post-processing or validation operations. Either way, the resulting field value from the .to_item() method (if it's overridden) could be totally different than calling the field directly.

We could also detect this scenario whenever some fields specified in include_fields=[...] or exclude_fields=[...] are not present in get_fields_dict(page_obj). If so, we can simply call the .to_item() method and include/exclude fields from there.

However, it's a wasteful operation since some fields could be expensive (i.e. having additional requests) and that's why they want to be excluded in the first place. But then, they were still unintentionally called via the .to_item() method.

In this case, we'd rely on the page object developer to design their page objects well and ensure that our docs highlight this caveat.

But still, there's the question of how to handle fields specified in include_fields=[...] or exclude_fields=[...] that are not existing at all. Let's tackle this in the further sections below (Spoiler: it'd be great to not support scenario C).

Handling field presence

I. Some fields specified in include_fields=[...] are not existing

An example would be:

@attrs.define
class SomeItem:
    x: Optional[int] = None
    y: Optional[int] = None
    
class SomePage(WebPage[SomeItem]):
    @field
    def x(self) -> int:
        return 1
        
    @field
    def y(self) -> int:
        return 2
        
partial_item_from_page_obj(some_page, include_field=["y", "z"])

For this case, we can simply ignore producing the z field value since the page object does not support it.

Moreover, if all of the given fields are not existing at all, partial_item_from_page_obj(some_page, include_fields=["z"]), an empty item would be returned.

Note that this is could be related to scenario C above and we have to be careful since a given field might be declared without using the @field decorator.

class SomePage(WebPage[SomeItem]):
    @field
    def x(self) -> int:
        return 1
        
    def to_item(self) -> SomeItem:
        return SomeItem(x=1, y=2)
        
partial_item_from_page_obj(some_page, include_fields=["y"])

Because of these types of scenarios, it'd be hard to fully trust deriving the fields from a page object via fields = get_fields_dict(page_obj).

SOLUTION 1: we can make it clear to our users via our docs that we will only call .to_item() if the page object explicitly doesn't use any @field decorators. This means that we won't be supporting scenario C at all.

II. Some fields specified in exclude_fields=[..] are not existing

The same case with scenario I where we can simply ignore non-existing fields.

However, it has the same problem about supporting .to_item() for scenario C, since there might be some fields that's using the @field decorator while the rest are produced via the .to_item() method.

To err on the side of caution, it could simply call .to_item() and then removing the fields declared in exclude_fields=[...]. Better yet, go with SOLUTION 1 above as well.

III. No fields were given in include_fields=[...]

For this, we could simply return an item with empty fields.

If any fields are required but are missing (i.e. None), we simply let Python error it out: TypeError: __init__() missing 1 required positional argument: ....

IV. No fields were given in exclude_fields=[...]

We could return the item with full fields, basically calling the .to_item().

Item setup

1. The item has all fields marked as Optional

There's no issue with this since including or excluding fields won't result into errors like TypeError: __init__() missing 1 required positional argument: ....

All of the above examples above has this item setup.

2. The item has fields marked as required

For example:

@attrs.define
class SomeItem:
    x: int
    y: int
    
class SomePage(ItemPage[SomeItem]):
    def x(self) -> int:
        return 1
        
    def y(self) -> int:
        return 2
        
partial_item_from_page_obj(some_page, include_fields=["x"])

Unlike in scenario 1, this results in TypeError: __init__() missing 1 required positional argument : ... since the y field is required.

One solution is to allow overriding the item class that the page object is returning which removes the required fields. Here's an example:

@attrs.define
class SomeSmallerItem:
    x: int
    
partial_item_from_page_obj(some_page, include_fields=["x"], item_cls=SomeSmallerItem)

The other API could be:

partial_item_from_item_cls(
    SomeItem, include_fields=["x"], url="https://example.com", replace_item_cls=SomeSmallerItem,
)

Summary

We only support page object setups of scenarios A and B while support for C is dropped.

This makes it easier for the item setups in scenario I and II. Scenario III and IV should work whatever the case may be.

The item setup for scenario 1 is straightforward while scenario 2 needs a way to replace/override the item that a page object returns with a smaller version.

@Gallaecio
Copy link
Member

Gallaecio commented Jan 23, 2023

Should we consider making an API change to the page object API?

I am thinking that if we put the field-filtering logic in to_item by either extending to_item to support additional, optional keyword-only parameters (e.g. include_fields: Optional[List] = None, exclude_fields: Optional[List] = None, on_unknown_field: Literal["ignore", "warn", "raise"] = "warn") or providing a new alternative method that does so (e.g. get_partial_item), supporting scenario C may be easier.

@BurnzZ BurnzZ pinned this issue Jan 24, 2023
@BurnzZ BurnzZ unpinned this issue Jan 24, 2023
@BurnzZ
Copy link
Contributor Author

BurnzZ commented Jan 24, 2023

Thanks @Gallaecio . Following your suggestion alongside @kmike's (during our call), we can compare the following approaches.

First, let's assume we have the following items:

@attrs.define
class Item:
    x: int
    y: Optional[int] = None
    z: Optional[int] = None


@attrs.define
class SmallItem:
    """Doesn't have the required 'x' field."""
    y: Optional[int] = None
    z: Optional[int] = None

Approach 1. Modifying .to_item() method

T = TypeVar('T') 

@attrs.define
class SomePage(WebPage[Item]):

    def to_item(
        self,
        *,
        include_fields: Optional[List] = None,
        exclude_fields: Optional[List] = None,
        swap_item_cls: Optional[T] = None,
        on_unknown_field: Literal["ignore", "warn", "raise"] = "warn",
    ) -> Union[Item, T]:
        ...

page = SomePage(response)

# The same PO can be re-used again to create different partial items.
page.to_item(include_fields=["x", "y"])  # Item(x=1, y=2, z=None)
page.to_item(include_fields=["x", "z"])  # Item(x=1, y=None, z=3)

# Use other item classes to avoid some required fields from causing an error.
page.to_item(exclude=["x"], swap_item_cls=SmallItem)  # SmallItem(y=2, z=3)

# behavior when encountering unknown fields could be controlled.
page.to_item(include_fields=["x", "b"], on_unknown_field="ignore")  # Item(x=1, y=None, z=None)

Approach 2. via dependencies

from web_poet import PageParams

@attrs.define
class SomePage(WebPage[Item]):
    # Users need to support the following to accept None so that it's still
    # compatible with returning the entire item data.

    # Perhaps we could also just remove this and rely on page_params?
    select_fields: Union[IncludeFields, ExcludeFields, None] = None

    # Contains 'swap_item_cls' and 'on_unknown_field' like the one in approach 1.
    page_params: Optional[PageParams] = None

# The PO needs multiple instances if different partial_items are needed 
# from the same response. Although might not be that bad since making POs
# are cheap. The only concern could be a slight increase in memory usage.

page = SomePage(response, select_fields=IncludeFields("x", "y"))
page.to_item()  # Item(x=1, y=2, z=None)

page = SomePage(response, select_fields=IncludeFields("x", "z"))
page.to_item()  # Item(x=1, y=None, z=3)

# Technically, we can't fully trust the .to_item() method since the user
# might've overridden it incorrectly. We'll probably need an external
# utility for extracting the partial item. Approach 1 could also be prone
# to such user mistakes. One advantage of approach 2 is that it has
# access to such metadata from the instance itself.

get_partial_item(page) # Item(x=1, y=None, z=3)

# But I'm not sure about this other `get_partial_item()` utility function.
# It's probably easier to write a good PO than to use this extra utility.

# Use other item classes to avoid some required fields from causing an error.
page = SomePage(
    response,
    select_fields=ExcludeFields("x"),
    page_params=PageParams({"swap_item_cls": SmallItem})
)
page.to_item()  # SmalItem(y=2, z=3)

# behavior when encountering unknown fields could be controlled.
page = SomePage(
    response,
    select_fields=IncludeFields("x", "b"),
    page_params=PageParams({"on_unknown_field": "ignore"})
)
page.to_item()  # Item(x=1, y=None, z=None)

I initially thought of avoiding from modifying the signature of the .to_item() method to prevent affecting the existing page objects written by users. But after this quick exercise, I think it should be okay and should be backward compatible after users modify their .to_item() method signatures to either accept the new params or just add **kwargs. Though generally, their code wouldn't break as long as those parameters aren't passed to .to_item().

Approach 2 could be a bit more challenging to implement in scrapy-poet since the IncludeFields/ExcludeFields dependencies need to be fulfilled first before the ItemProvider can properly start its work (ref PR).

I'm inclined to go with Approach 1 on this case.

Although I might be missing some other important use cases or approaches. Looking forward for your thoughts.

@kmike
Copy link
Member

kmike commented Jan 25, 2023

Though generally, their code wouldn't break as long as those parameters aren't passed to .to_item().

The code shouldn't break even if scrapy-poet user decides to exclude fields. I.e. we should support a case where page object has to_item method without parameters, but user requests some fields to be excluded.

You can implement it by checking the signature, and not passing the parameters if they're not supported. The issue is that it's more code, which is not needed if we use dependency approach.

select_fields: Union[IncludeFields, ExcludeFields, None] = None

Does it mean you can either pass include fields, or exclude fields, but not both at the same time?

It also seems it might be not very convenient to write code which uses select_fields:

def to_item(self):
    item = {}
    if not (isinstance(self.select_fields, ExcludeFields) and "price" in self.select_fields):
        # ... extract price
        # fixme: IncludeFields are not handled. 
        item["price"] = self.css(...)
    ...  

Handling both may be even trickier; you also need to handle a self.select_fields=None case. Could you please try writing an example? How can we make it more manageable?

I was thinking about using a single dependency for it:

select_fields: IncludeExcludeOptions

# ...

def to_item(self):
    item = {}
    if self.select_fields.not_excluded("price"):
        # this should handle both include and exclude options; also, None handling is not needed
        item["price"] = self.css(...)

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Jan 25, 2023

Does it mean you can either pass include fields, or exclude fields, but not both at the same time?

Yes. Though we can include the possibility of include=["x", "y"], exclude=["y", "z"] which becomes ["x"].

What do you think about this modified Approach 2?

@attrs.define
class SelectFields:
    # could be Union[Iterable[str], str] to support a terse usage.
    include: Optional[Iterable[str]] = None
    exclude: Optional[Iterable[str]] = None

    # Compatible with the default skip_nonitem_fields=False
    on_unknown_field: Literal["ignore", "warn", "raise"] = "raise"
    swap_item_cls: Optional[Type] = None


@attrs.define
class SomePage(WebPage[Item]):
    select_fields: Optional[SelectFields] = None

    def to_item(self):
        ...

page = SomePage(response, select_fields=SelectFields(include=["x", "y"]))
page.to_item()  # Item(x=1, y=2, z=None)

page = SomePage(response, select_fields=SelectFields(exclude=["y"]))
page.to_item()  # Item(x=1, y=None, z=3)

page = SomePage(response, select_fields=SelectFields(include=("x", "y"), exclude=["y", "z"]))
page.to_item()  # Item(x=1, y=None, z=None)

page = SomePage(response, select_fields=SelectFields(exclude=["x"], swap_item_cls=SmallItem))
page.to_item()  # SmallItem(y=2, z=3)

The implementation should be more straightforward as it becomes a set operation (i.e. (set(include or []) or set(po_fields)) - set(exclude or [])).

I think we can even take a step further by implemeting .to_partial_item() which .to_item() calls behind the scenes, if SelectFields is available.

page = SomePage(response, select_fields=SelectFields(include=["x", "y"]))
page.to_item()         # Item(x=1, y=2, z=None)
page.to_partial_item() # Item(x=1, y=2, z=None)

page = SomePage(response)
page.to_item()                                         # Item(x=1, y=2, z=3)
page.to_partial_item(SelectFields(include=["x", "y"])) # Item(x=1, y=2, z=None)

I think this can be re-used in scrapy-poet as well:

class MySpider(scrapy.Spider):
    # some code

    def parse_item(
        self,
        response,
        item: Annotated[Item, SelectFields(exclude=["z"])],
    ):
        yield item

or

class MySpider(scrapy.Spider):
    # some code

    def parse_home(self, response):
        yield response.follow(
            some_url,
            callback=self.parse_item,
            meta={
                "select_fields": Annotated[Item, SelectFields(exclude=["z"])])
            }
        )

    # ItemProvider should be able to map the Annotated item from request.meta to 
    # the item declared in this callback, and perform the appropriate partial extraction.
    def parse_item(self, response, item: Item):
        yield item

@BurnzZ BurnzZ mentioned this issue Jan 27, 2023
2 tasks
@BurnzZ
Copy link
Contributor Author

BurnzZ commented Jan 30, 2023

From the new propsal in #120, we can also have an Approach 3:

from web_poet import field

@attrs.define
class SelectFields:
    fields: Optional[Dict[str, bool]] = None

    # Compatible with the default skip_nonitem_fields=False
    on_unknown_field: Literal["ignore", "warn", "raise"] = "raise"
    swap_item_cls: Optional[Type] = None

@attrs.define
class SomePage(WebPage[Item]):
    select_fields: Optional[SelectFields] = None

    @field
    def x(self):
        return 1   

    @field
    def y(self):
        return 2

    @field(to_return=False)  # 'to_return' could be renamed
    def z(self):
        return 3

page = SomePage(response)
page.to_item()  # Item(x=1, y=2, z=None)

page = SomePage(response, select_fields=SelectFields(fields={"y": False}))
page.to_item()  # Item(x=1, y=None, z=None)

page = SomePage(response, select_fields=SelectFields(fields={"x": True, "y": True, "z": True}))
page.to_item()  # Item(x=1, y=2, z=2)

page = SomePage(response, select_fields=SelectFields(fields={"*": True}))
page.to_item()  # Item(x=1, y=2, z=3)

page = SomePage(response, select_fields=SelectFields(fields={"*": False, "x": True}))
page.to_item()  # Item(x=1, y=None, z=None)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants