From 677c1fef5be598d7a5cf63cd13e4171fd5dbeb7b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Adri=C3=A1n=20Chaves?= Date: Thu, 25 Jan 2024 15:09:47 +0100 Subject: [PATCH] Refactor the additional request docs --- docs/frameworks/additional-requests.rst | 8 +- docs/index.rst | 1 - docs/intro/overview.rst | 2 + docs/page-objects/additional-requests.rst | 857 +++------------------- docs/page-objects/input-validation.rst | 118 --- docs/page-objects/inputs.rst | 118 +++ 6 files changed, 212 insertions(+), 892 deletions(-) delete mode 100644 docs/page-objects/input-validation.rst diff --git a/docs/frameworks/additional-requests.rst b/docs/frameworks/additional-requests.rst index 7a5492f3..5d281317 100644 --- a/docs/frameworks/additional-requests.rst +++ b/docs/frameworks/additional-requests.rst @@ -140,10 +140,10 @@ syntax. Exception Handling ------------------ -In the previous :ref:`exception-handling` section, we can see how Page Object -developers could use the exception classes built inside **web-poet** to handle -various ways additional requests MAY fail. In this section, we'll see the -rationale and ways the framework MUST be able to do that. +Page Object developers could use the exception classes built inside +**web-poet** to handle various ways additional requests MAY fail. In this +section, we'll see the rationale and ways the framework MUST be able to do +that. Rationale ********* diff --git a/docs/index.rst b/docs/index.rst index 121ca257..f97a67db 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -28,7 +28,6 @@ web-poet page-objects/rules page-objects/fields page-objects/additional-requests - page-objects/input-validation page-objects/page-params page-objects/stats page-objects/testing diff --git a/docs/intro/overview.rst b/docs/intro/overview.rst index 003f2465..ded52901 100644 --- a/docs/intro/overview.rst +++ b/docs/intro/overview.rst @@ -1,3 +1,5 @@ +.. _overview: + ======== Overview ======== diff --git a/docs/page-objects/additional-requests.rst b/docs/page-objects/additional-requests.rst index 73c381cc..ac82dd07 100644 --- a/docs/page-objects/additional-requests.rst +++ b/docs/page-objects/additional-requests.rst @@ -4,799 +4,114 @@ Additional requests =================== -Websites nowadays needs a lot of page interactions to display or load some key -information. In most cases, these are done via AJAX requests. Some examples of these are: +Some websites require page interactions to load some information, such as +clicking a button, scrolling down or hovering on some element. These +interactions usually trigger background requests that are then loaded using +JavaScript. - * Clicking a button on a page to reveal other similar products. - * Clicking the `"Load More"` button to retrieve more images of a given item. - * Scrolling to the bottom of the page to load more items `(i.e. infinite scrolling)`. - * Hovering on a certain webpage element that reveals a tool-tip containing - additional page info. +To extract such data, reproduce those requests using :class:`~.HttpClient`. +Include :class:`~.HttpClient` among the :ref:`inputs ` of your +:ref:`page object `, and use an asynchronous :ref:`field +` or method to call one of its methods. For example, you can +call :meth:`HttpClient.execute <.HttpClient.execute>` with an +:class:`~.HttpRequest` as input to get an :class:`~.HttpResponse` as output. -As such, performing additional requests inside Page Objects are inevitable to -properly extract data for some websites. - -.. warning:: - - Additional requests made inside a Page Object aren't meant to represent - the **Crawling Logic** at all. They are simply a low-level way to interact - with today's websites which relies on a lot of page interactions to display - its contents. - -.. _httprequest-example: - -HttpRequest -=========== - -Additional requests are defined using a simple data container that represents -a generic HTTP Request: :class:`~.HttpRequest`. Here's an example: - -.. code-block:: python - - import json - import web_poet - - request = web_poet.HttpRequest( - url="https://www.api.example.com/product-pagination/", - method="POST", - headers={ - "Content-Type": "application/json;charset=UTF-8" - }, - body=json.dumps( - { - "Page": page_num, - "ProductID": product_id, - } - ).encode("utf-8"), - ) - - print(request.url) # https://www.api.example.com/product-pagination/ - print(type(request.url)) # - print(request.method) # POST - - print(type(request.headers) # - print(request.headers) # - print(request.headers.get("content-type")) # application/json;charset=UTF-8 - print(request.headers.get("does-not-exist")) # None - - print(type(request.body)) # - print(request.body) # b'{"Page": 1, "ProductID": 123}' - -There are a few things to take note here: - - * ``method`` is simply a **string**. - * ``url`` is represented by the :class:`~.RequestUrl` class. - * ``headers`` is represented by the :class:`~.HttpRequestHeaders` class which - resembles a ``dict``-like interface. It supports case-insensitive header-key - lookups as well as multi-key storage. - - * See :external:py:class:`multidict.CIMultiDict` for the set of features - since :class:`~.HttpRequestHeaders` simply inherits from it. - - * ``body`` is represented by the :class:`~.HttpRequestBody` class which is - simply a subclass of the ``bytes`` class. Using the ``body`` param of - :class:`~.HttpRequest` needs to have an input argument in ``bytes``. In our - code example, we've converted it from ``str`` to ``bytes`` using the ``encode()`` - string method. - -Most of the time though, what you'll be defining would be ``GET`` requests. Thus, -it's perfectly fine to define them as: - -.. code-block:: python - - import web_poet - - request = web_poet.HttpRequest("https://api.example.com/product-info?id=123") - - print(request.url) # https://api.example.com/product-info?id=123 - print(type(request.url)) # - print(request.method) # GET - - print(type(request.headers) # - print(request.headers) # - print(request.headers.get("content-type")) # None - print(request.headers.get("does-not-exist")) # None - - print(type(request.body)) # - print(request.body) # b'' - -The key take aways are: - - * The default value of ``method`` is ``GET``. - * ``headers`` still holds :class:`~.HttpRequestHeaders` which doesn't contain - anything. - * The same is true for ``body`` holding an empty :class:`~.HttpRequestBody`. - -Now that we know how :class:`~.HttpRequest` are structured, defining them doesn't -execute the actual requests at all. In order to do so, we'll need to feed it into -the :class:`~.HttpClient` which is defined in the next section (see -:ref:`httpclient` tutorial section). - -HttpResponse -============ - -:class:`~.HttpResponse` is what comes after a :class:`~.HttpRequest` has been -executed. It's typically returned by the methods from :class:`~.HttpClient` (see -:ref:`httpclient` tutorial section) which holds the information regarding the response. - -:class:`~.HttpResponse` can also be used as a Page Object dependency, -e.g. :class:`~.WebPage` uses it. - -.. note:: - - The additional requests are expected to perform redirections except when the - method is ``HEAD``. This means that the :class:`~.HttpResponse` that you'll - be receiving is already the end of the redirection trail. - -Let's check out an example to see its internals: - -.. code-block:: python - - import web_poet - - response = web_poet.HttpResponse( - url="https://www.api.example.com/product-pagination/", - body='{"data": "value 👍"}'.encode("utf-8"), - status=200, - headers={"Content-Type": "application/json;charset=UTF-8"} - ) - - print(response.url) # https://www.api.example.com/product-pagination/ - print(type(response.url)) # - - print(response.body) # b'{"data": "value \xf0\x9f\x91\x8d"}' - print(type(response.body)) # - - print(response.status) # 200 - print(type(response.status)) # - - print(response.headers) # - print(type(response.headers)) # - print(response.headers.get("content-type")) # application/json;charset=UTF-8 - print(response.headers.get("does-not-exist")) # None - - # These methods are also available: - - print(response.body.declared_encoding()) # None - print(response.body.json()) # {'data': 'value 👍'} - - print(response.headers.declared_encoding()) # utf-8 - - print(response.encoding) # utf-8 - print(response.text) # {"data": "value 👍"} - print(response.json()) # {'data': 'value 👍'} - -Despite what the example above showcases, you won't be typically defining -:class:`~.HttpResponse` yourself as it's the implementing framework (see -:ref:`framework-additional-requests`) that's responsible for it. Nonetheless, -it's important to understand its underlying structure in order to better access -its methods. - -Here are the key take aways from the example above: - - * ``status`` is simply an **int**. - * ``url`` is represented by the :class:`~.ResponseUrl` class. - * ``headers`` is represented by the :class:`~.HttpResponseHeaders` class. - It's similar to :class:`~.HttpRequestHeaders` where it inherits from - :external:py:class:`multidict.CIMultiDict`, granting it case-insensitive - header-key lookups as well as multi-key storage. - - * The **encoding** can be derived using the :meth:`~.HttpResponseHeaders.declared_encoding` - method. In this example, it was retrieved from the ``Content-Type`` header. - - * ``body`` is represented by the :class:`~.HttpResponseBody` class which is - simply a subclass of the ``bytes`` class. Using the ``body`` param of - :class:`~.HttpResponse` needs to have an input argument in ``bytes``. In our - code example, we've converted it from ``str`` to ``bytes`` using the ``encode()`` - string method. - - * Similar to the headers, the **encoding** can be derived using the - :meth:`~.HttpResponseBody.declared_encoding`. In this case, it returned - ``None`` since no encoding can be derived from the response body. - * A :meth:`~.HttpResponseBody.json` method is also available to conveniently - access decoded contents from JSON responses. It uses the derived **encoding** - to properly decode the contents like the 👍 emoji. - - * The :class:`~.HttpResponse` class itself also have these convenient methods: - - * The :meth:`~.HttpResponse.encoding` property method returns the proper - encoding of the response based on this hierarchy: - - * user-specified encoding (`using the` ``_encoding`` `attribute`) - * BOM from the body - * header encodings - * body encodings - - * Instead of accessing the raw bytes values `(which doesn't represent the - underlying content properly like the` 👍 `emoji)`, the :meth:`~.HttpResponse.text` - property method can be used which takes into account the derived **encoding** - when decoding the bytes value. - * The :meth:`~.HttpResponse.json` method is available as a shortcut to - :class:`~.HttpResponseBody`'s :meth:`~.HttpResponseBody.json` method. - -We've only explored a JSON response as a result from an additional request. Let's -take a look at another example having an HTML response: - -.. code-block:: python - - import web_poet - - response = web_poet.HttpResponse( - url="https://www.api.example.com/product-pagination/", - body=( - '' - ' ' - ' Some page' - ' ' - ' ' - ' Sample content 💯' - '' - ).encode("utf-8"), - status=200, - headers={} - ) - - print(response.headers.declared_encoding()) # None - print(response.body.declared_encoding()) # utf-8 - print(response.encoding) # utf-8 - - print(response.body.json()) # JSONDecodeError - print(response.json()) # JSONDecodeError - - print(type(response.selector)) # - - print(response.selector.css("body ::text").get()) # Sample content 💯 - print(response.css("body ::text").get()) # Sample content 💯 - - print(response.selector.xpath("//body/text()").get()) # Sample content 💯 - print(response.xpath("//body/text()").get()) # Sample content 💯 - -The key take aways for this example are: - - * The **encoding** is derived from the body inside the ``meta`` tags since the - ``headers`` is empty for this example. - * Since we now have an HTML response, using :meth:`~.HttpResponseBody.json` - method would raise a ``JSONDecodeError`` as a JSON document cannot be - parsed from it. - * The :meth:`~.HttpResponse.selector` property is an instance of - :external:py:class:`parsel.selector.Selector`; there are also - :meth:`~.HttpResponse.css` and :meth:`~.HttpResponse.xpath` methods. - - * Usually there's no need to use :meth:`~.HttpResponse.selector`, as - :meth:`~.HttpResponse.css` and :meth:`~.HttpResponse.xpath` are - available. - - -.. _httpclient: - -HttpClient -========== - -The main interface for executing additional requests would be :class:`~.HttpClient`. -It also has full support for :mod:`asyncio` enabling developers to perform -additional requests asynchronously using :py:func:`asyncio.gather`, -:py:func:`asyncio.wait`, etc. This means that :mod:`asyncio` could be used anywhere -inside the Page Object, including the :meth:`~.ItemPage.to_item` method. - -In the previous section, we've explored how :class:`~.HttpRequest` is defined. -Let's see a few quick examples to see how to execute additional requests using -the :class:`~.HttpClient`. - -Executing a HttpRequest instance --------------------------------- - -.. code-block:: python - - import attrs - import web_poet - from web_poet import validates_input - - - @attrs.define - class ProductPage(web_poet.WebPage): - http: web_poet.HttpClient - - @validates_input - async def to_item(self): - item = { - "url": self.url, - "name": self.css("#main h3.name ::text").get(), - "product_id": self.css("#product ::attr(product-id)").get(), - } - - # Simulate clicking on a button that says "View All Images" - request = web_poet.HttpRequest(f"https://api.example.com/v2/images?id={item['product_id']}") - response: web_poet.HttpResponse = await self.http.execute(request) - - item["images"] = response.css(".product-images img::attr(src)").getall() - return item - -As the example suggests, we're performing an additional request that allows us -to extract more images in a product page that might not be otherwise be possible. -This is because in order to do so, an additional button needs to be clicked -which fetches the complete set of product images via AJAX. - -There are a few things to take note of this example: - - * Recall from the :ref:`httprequest-example` tutorial section that the - default method is ``GET``. Thus, the ``method`` parameter can be omitted - for simple ``GET`` requests. - * We're now using the ``async/await`` syntax inside the :meth:`~.ItemPage.to_item` - method. - * The response from the additional request is of type :class:`~.HttpResponse`. - -.. tip:: - - Check out the :ref:`http-batch-request-example` tutorial section to see how - to execute a group of :class:`~.HttpRequest` in batch. - -Fortunately, there are already some quick shortcuts on how to perform single -additional requests using the :meth:`~.HttpClient.request`, :meth:`~.HttpClient.get`, -and :meth:`~.HttpClient.post` methods of :class:`~.HttpClient`. These already -define the :class:`~.HttpRequest` and executes it as well. - -.. _httpclient-get-example: - -A simple ``GET`` request ------------------------- - -Let's use the example from the previous section and use the :meth:`~.HttpClient.get` -method on it. - -.. code-block:: python - - import attrs - import web_poet - from web_poet import validates_input - - - @attrs.define - class ProductPage(web_poet.WebPage): - http: web_poet.HttpClient - - @validates_input - async def to_item(self): - item = { - "url": self.url, - "name": self.css("#main h3.name ::text").get(), - "product_id": self.css("#product ::attr(product-id)").get(), - } - - # Simulates clicking on a button that says "View All Images" - response: web_poet.HttpResponse = await self.http.get( - f"https://api.example.com/v2/images?id={item['product_id']}" - ) - item["images"] = response.css(".product-images img::attr(src)").getall() - return item - -There are a few things to take note in this example: - - * A ``GET`` request can be done via :class:`~.HttpClient`'s - :meth:`~.HttpClient.get` method. - * There is no need create an instance of :class:`~.HttpRequest` when - :meth:`~.HttpClient.get` is used. - -.. _request-post-example: - -A ``POST`` request with `header` and `body` -------------------------------------------- - -Let's see another example which needs ``headers`` and ``body`` data to process -additional requests. - -In this example, we'll paginate related items in a carousel. These are -usually lazily loaded by the website to reduce the amount of information -rendered in the DOM that might not otherwise be viewed by all users anyway. - -Thus, additional requests inside the Page Object are typically needed for it: +For example, simulating a click on a button that loads product images could +look like: .. code-block:: python import attrs - import web_poet - from web_poet import validates_input + from web_poet import HttpClient, HttpError, WebPage, field + from zyte_common_items import Product @attrs.define - class ProductPage(web_poet.WebPage): - http: web_poet.HttpClient - - @validates_input - async def to_item(self): - item = { - "url": self.url, - "name": self.css("#main h3.name ::text").get(), - "product_id": self.css("#product ::attr(product-id)").get(), - "related_product_ids": self.parse_related_product_ids(self), - } - - # Simulates "scrolling" through a carousel that loads related product items - response: web_poet.HttpResponse = await self.http.post( - url="https://www.api.example.com/related-products/", - headers={ - "Content-Type": "application/json;charset=UTF-8" - }, - body=json.dumps( - { - "Page": 2, - "ProductID": item["product_id"], - } - ).encode("utf-8"), - ) - item["related_product_ids"].extend(self.parse_related_product_ids(response)) - return item - - @staticmethod - def parse_related_product_ids(response_page) -> List[str]: - return response_page.css("#main .related-products ::attr(product-id)").getall() - -Here's the key takeaway in this example: - - * Similar to :class:`~.HttpClient`'s :meth:`~.HttpClient.get` method, - a :meth:`~.HttpClient.post` method is also available. It is - often used to submit forms. - -Other Single Requests ---------------------- - -The :meth:`~.HttpClient.get` and :meth:`~.HttpClient.post` methods are merely -quick shortcuts for :meth:`~.HttpClient.request`: - -.. code-block:: python - - client = HttpClient() - - url = "https://api.example.com/v1/data" - headers = {"Content-Type": "application/json;charset=UTF-8"} - body = b'{"data": "value"}' - - # These are the same: - response = await client.get(url) - response = await client.request(url, method="GET") - - # The same goes for these: - response = await client.post(url, headers=headers, body=body) - response = await client.request(url, method="POST", headers=headers, body=body) - -Thus, apart from the common ``GET`` and ``POST`` HTTP methods, you can use -:meth:`~.HttpClient.request` for them (`e.g.` ``HEAD``, ``PUT``, ``DELETE``, etc). - -.. _http-batch-request-example: - -Batch requests --------------- - -We can also choose to process requests by **batch** instead of sequentially or -one by one (e.g. using :meth:`~.HttpClient.execute`). The :meth:`~.HttpClient.batch_execute` -method can be used for this which accepts an arbitrary number of :class:`~.HttpRequest` -instances. - -Let's modify the example in the previous section to see how it can be done. - -The difference for this code example from the previous section is that we're -increasing the pagination from only the **2nd page** into the **10th page**. -Instead of calling a single :meth:`~.HttpClient.post` method, we're creating a -list of :class:`~.HttpRequest` to be executed in batch using the -:meth:`~.HttpClient.batch_execute` method. - -.. code-block:: python - - from typing import List + class ProductPage(WebPage[Product]): + http: HttpClient - import attrs - import web_poet - from web_poet import validates_input + @field + def productId(self): + return self.css("::attr(product-id)").get() + @field + async def images(self): + api_url = f"https://api.example.com/v2/images?id={self.productId}" + try: + response = await self.http.get(api_url) + except HttpError: + return [] + else: + return response.css(".product-images img::attr(src)").getall() - @attrs.define - class ProductPage(web_poet.WebPage): - http: web_poet.HttpClient - - default_pagination_limit = 10 - - @validates_input - async def to_item(self): - item = { - "url": self.url, - "name": self.css("#main h3.name ::text").get(), - "product_id": self.css("#product ::attr(product-id)").get(), - "related_product_ids": self.parse_related_product_ids(self), - } - - requests: List[web_poet.HttpRequest] = [ - self.create_request(item["product_id"], page_num=page_num) - for page_num in range(2, self.default_pagination_limit) - ] - responses: List[web_poet.HttpResponse] = await self.http.batch_execute(*requests) - related_product_ids = [ - id_ - for response in responses - for product_ids in self.parse_related_product_ids(response) - for id_ in product_ids - ] +.. warning:: - item["related_product_ids"].extend(related_product_ids) - return item - - def create_request(self, product_id, page_num=2): - # Simulates "scrolling" through a carousel that loads related product items - return web_poet.HttpRequest( - url="https://www.api.example.com/product-pagination/", - method="POST", - headers={ - "Content-Type": "application/json;charset=UTF-8" - }, - body=json.dumps( - { - "Page": page_num, - "ProductID": product_id, - } - ).encode("utf-8"), - ) - - @staticmethod - def parse_related_product_ids(response_page) -> List[str]: - return response_page.css("#main .related-products ::attr(product-id)").getall() - -The key takeaways for this example are: - - * An :class:`~.HttpRequest` can be instantiated to represent a Generic HTTP Request. - It only contains the HTTP Request information for now and isn't executed yet. - This is useful for creating factory methods to help create requests without any - download execution at all. - * :class:`~.HttpClient` has a :meth:`~.HttpClient.batch_execute` method that - can process a list of :class:`~.HttpRequest` instances asynchronously together. - -.. tip:: - - The :meth:`~.HttpClient.batch_execute` method can execute multiple - :class:`~.HttpRequest` instances. For example, it could be a mixture - of ``GET`` and ``POST`` requests or even - representing requests for various parts of the page altogether. - - Processing the additional requests in batch is useful since it takes advantage - of async execution which could be faster in certain cases `(assuming you're - allowed to perform HTTP requests in parallel)`. - - Nonetheless, you can still use the :meth:`~.HttpClient.batch_execute` method - to execute a single :class:`~.HttpRequest` instance. + :class:`~.HttpClient` should only be used to handle the type of scenarios + mentioned above. Using :class:`~.HttpClient` for crawling logic would + defeat :ref:`the purpose of web-poet `. .. note:: - The :meth:`~.HttpClient.batch_execute` method is a simple wrapper over - :py:func:`asyncio.gather`. Developers are free to use other functionalities - available inside :mod:`asyncio` to handle multiple requests. - - For example, :py:func:`asyncio.as_completed` can be used to process the - first response from a group of requests as early as possible. However, the - order could be shuffled. + :meth:`HttpClient.execute <~.HttpClient.execute>` is expected to follow any + redirection except when the request method is ``HEAD``. This means that the + :class:`~.HttpResponse` that you get is already the end of any redirection + trail. -.. _exception-handling: - -Handling Exceptions in Page Objects -=================================== +Concurrent requests +=================== -Let's have a look at how we could handle exceptions when performing additional -requests inside Page Objects. For this example, let's improve the code snippet -from the previous subsection named: :ref:`httpclient-get-example`. +To send multiple requests concurrently, use :meth:`HttpClient.batch_execute +<.HttpClient.batch_execute>`, which accepts any number of +:class:`~.HttpRequest` instances as input, and returns :class:`~.HttpResponse` +instances (and :class:`~.HttpError` instances when using +``return_exceptions=True``) in the input order. For example: .. code-block:: python - import logging - import attrs - import web_poet - from web_poet import validates_input - - logger = logging.getLogger(__name__) + from web_poet import HttpClient, HttpError, HttpRequest, WebPage, field + from zyte_common_items import Product, ProductVariant @attrs.define - class ProductPage(web_poet.WebPage): - http: web_poet.HttpClient - - @validates_input - async def to_item(self): - item = { - "url": self.url, - "name": self.css("#main h3.name ::text").get(), - "product_id": self.css("#product ::attr(product-id)").get(), - } - - try: - # Simulates clicking on a button that says "View All Images" - response: web_poet.HttpResponse = await self.http.get( - f"https://api.example.com/v2/images?id={item['product_id']}" - ) - except web_poet.exceptions.HttpRequestError as err: - logger.warning( - f"Unable to request images for product ID '{item['product_id']}' " - f"using this request: {err.request}" - ) - except web_poet.exceptions.HttpResponseError as err: - logger.warning( - f"Received a {err.response.status} response status for product ID " - f"'{item['product_id']}' from this URL: {err.request.url}" - ) - else: - item["images"] = response.css(".product-images img::attr(src)").getall() - - return item - -In this code example, the code became more resilient on cases where it wasn't -possible to retrieve more images using the website's public API. It could be -due to anything like `SSL errors`, `connection errors`, `page not found`, etc. - -Using :class:`~.HttpClient` to execute requests raises exceptions with the base -class of type :class:`web_poet.exceptions.http.HttpError` irregardless of how -the HTTP Downloader is implemented. From our example above, we could've simply -used the :class:`web_poet.exceptions.http.HttpError` base error. However, it's -ambiguous in the sense that the error could originate during the HTTP Request -execution or when receiving the HTTP Response. - -A more specific :class:`web_poet.exceptions.http.HttpRequestError` exception is -raised when the :class:`~.HttpRequest` was being handled while the -:class:`web_poet.exceptions.http.HttpResponseError` is raised when receiving -a response with an HTTP error. Notice from the example that the exceptions have -the attributes like ``request`` and ``response`` which are respective instance of -:class:`~.HttpRequest` and :class:`~.HttpResponse`. Accessing them would be useful -to debug and log the problems. - -Note that :class:`web_poet.exceptions.http.HttpResponseError` only occurs when -receiving responses with status codes in the ``400-5xx`` range. However, this -behavior could be altered by using the ``allow_status`` param in the methods of -:class:`~.HttpClient`. - -.. note:: - - In the future, more specific exceptions which inherits from the base - :class:`web_poet.exceptions.http.HttpError` exception would be available. - This should allow developers writing Page Objects to properly identify what - went wrong and act specifically based on the problem. - -Let's take another example when executing requests in batch as opposed to using -single requests via these methods of the :class:`~.HttpClient`: -:meth:`~.HttpClient.request`, :meth:`~.HttpClient.get`, and :meth:`~.HttpClient.post`. - -For this example, let's improve the code snippet from the previous subsection named: -:ref:`http-batch-request-example`. - -.. code-block:: python - - import logging - from typing import List, Union + class ProductPage(WebPage[Product]): + http: HttpClient - import attrs - import web_poet - from web_poet import validates_input + max_variants = 10 + @field + def productId(self): + return self.css("::attr(product-id)").get() - @attrs.define - class ProductPage(web_poet.WebPage): - http: web_poet.HttpClient - - default_pagination_limit = 10 - - @validates_input - async def to_item(self): - item = { - "url": self.url, - "name": self.css("#main h3.name ::text").get(), - "product_id": self.css("#product ::attr(product-id)").get(), - "related_product_ids": self.parse_related_product_ids(self), - } - - requests: List[web_poet.HttpRequest] = [ - self.create_request(item["product_id"], page_num=page_num) - for page_num in range(2, self.default_pagination_limit) + @field + async def variants(self): + requests = [ + HttpRequest(f"https://example.com/api/variant/{self.productId}/{index}") + for index in range(self.max_variants) ] - - try: - responses: List[web_poet.HttpResponse] = await self.http.batch_execute(*requests) - except web_poet.exceptions.HttpError: - logger.warning( - f"Unable to request for more related products for product ID: {item['product_id']}" - ) - else: - related_product_ids = [] - for response in responses: - related_product_ids.extend( - [ - id_ - for product_ids in self.parse_related_product_ids(response) - for id_ in product_ids - ] - ) - item["related_product_ids"].extend(related_product_ids) - - return item - - def create_request(self, product_id, page_num=2): - # Simulates "scrolling" through a carousel that loads related product items - return web_poet.HttpRequest( - url="https://www.api.example.com/product-pagination/", - method="POST", - headers={ - "Content-Type": "application/json;charset=UTF-8" - }, - body=json.dumps( - { - "Page": page_num, - "ProductID": product_id, - } - ).encode("utf-8"), - ) - - @staticmethod - def parse_related_product_ids(response_page) -> List[str]: - return response_page.css("#main .related-products ::attr(product-id)").getall() - -Handling exceptions using :meth:`~.HttpClient.batch_execute` remains largely the same. -However, the main difference is that you may be wasting perfectly good responses just -because a single request from the batch ruined it. Notice that we're using the base -exception class of :class:`web_poet.exceptions.http.HttpError` to account for any -type of errors, both during the HTTP Request execution and when receiving the -response. - -An alternative approach would be salvaging good responses altogether. For example, you've -sent out 10 :class:`~.HttpRequest` and only 1 of them had an exception during processing. -You can still get the data from 9 of the :class:`~.HttpResponse` by passing the parameter -``return_exceptions=True`` to :meth:`~.HttpClient.batch_execute`. - -This means that any exceptions raised during the HTTP execution are returned alongside any -of the successful responses. The return type of :meth:`~.HttpClient.batch_execute` could -be a mixture of :class:`~.HttpResponse` and :class:`web_poet.exceptions.http.HttpError` -(*and its exception subclasses*). - -Here's an example: - -.. code-block:: python - - # Revised code snippet from the to_item() method - - requests: List[web_poet.HttpRequest] = [ - self.create_request(item["product_id"], page_num=page_num) - for page_num in range(2, self.default_pagination_limit) - ] - - responses: List[Union[web_poet.HttpResponse, web_poet.exceptions.HttpError]] = ( - await self.http.batch_execute(*requests, return_exceptions=True) - ) - - related_product_ids = [] - for i, response in enumerate(responses): - if isinstance(response, web_poet.exceptions.HttpError): - logger.warning( - f"Unable to request related products for product ID '{item['product_id']}' " - f"using this request: {requests[i]}. Reason: {response}." - ) - continue - related_product_ids.extend( - [ - id_ - for product_ids in self.parse_related_product_ids(response) - for id_ in product_ids + responses = await self.http.batch_execute(*requests, return_exceptions=True) + return [ + ProductVariant(color=response.css("::attr(color)").get()) + for response in responses + if not isinstance(response, HttpError) ] - ) - - item["related_product_ids"].extend(related_product_ids) - return item -From the example above, we're now checking the list of responses to see if any -exceptions are included in it. If so, we're simply logging it down and ignoring -it. In this way, perfectly good responses can still be processed through. +You can alternatively use :mod:`asyncio` together with :class:`~.HttpClient` to +handle multiple requests. For example, you can use :func:`asyncio.as_completed` +to process the first response from a group of requests as early as possible. .. _retries-additional-requests: -Retrying Additional Requests +Retrying additional requests ============================ -When the bad response data comes from :ref:`additional requests -`, you must handle retries on your own. +:ref:`Input validation ` allows retrying all inputs from a +page object. To retry only additional requests, you must handle retries on your +own. -The page object code is responsible for retrying additional requests until good -response data is received, or until some maximum number of retries is exceeded. +Your code is responsible for retrying additional requests until good response +data is received, or until some maximum number of retries is exceeded. It is up to you to decide what the maximum number of retries should be for a given additional request, based on your experience with the target website. @@ -812,26 +127,30 @@ times before giving up: import attrs from tenacity import retry, stop_after_attempt - from web_poet import HttpClient, WebPage, validates_input + from web_poet import HttpClient, HttpError, WebPage, field + from zyte_common_items import Product + @attrs.define - class MyPage(WebPage): + class ProductPage(WebPage[Product]): http: HttpClient + @field + def productId(self): + return self.css("::attr(product-id)").get() + @retry(stop=stop_after_attempt(3)) - async def get_data(self): - response = await self.http.get("https://toscrape.com/") - if not response.css(".expected"): - raise ValueError - return response.css(".data").get() - - @validates_input - async def to_item(self) -> dict: + async def get_images(self): + return self.http.get(f"https://api.example.com/v2/images?id={self.productId}") + + @field + async def images(self): try: - data = await self.get_data() - except ValueError: - return {} - return {"data": data} + response = await self.get_images() + except HttpError: + return [] + else: + return response.css(".product-images img::attr(src)").getall() If the reason your additional request fails is outdated or missing data from page object input, do not try to reproduce the request for that input as an diff --git a/docs/page-objects/input-validation.rst b/docs/page-objects/input-validation.rst deleted file mode 100644 index dba39651..00000000 --- a/docs/page-objects/input-validation.rst +++ /dev/null @@ -1,118 +0,0 @@ -.. _input-validation: - -================ -Input validation -================ - -Sometimes the data that your page object receives as input may be invalid. - -You can define a ``validate_input`` method in a page object class to check its -input data and determine how to handle invalid input. - -``validate_input`` is called on the first execution of ``ItemPage.to_item()`` -or the first access to a :ref:`field `. In both cases validation -happens early; in the case of fields, it happens before field evaluation. - -``validate_input`` is a synchronous method that expects no parameters, and its -outcome may be any of the following: - -- Return ``None``, indicating that the input is valid. - -.. _retries-input: - -- Raise :exc:`~web_poet.exceptions.Retry`, indicating that the input - looks like the result of a temporary issue, and that trying to fetch - similar input again may result in valid input. - - See also :ref:`retries-additional-requests`. - -- Raise :exc:`~web_poet.exceptions.UseFallback`, indicating that the - page object does not support the input, and that an alternative parsing - implementation should be tried instead. - - For example, imagine you have a page object for website commerce.example, - and that commerce.example is built with a popular e-commerce web framework. - You could have a generic page object for products of websites using that - framework, ``FrameworkProductPage``, and a more specific page object for - commerce.example, ``EcommerceExampleProductPage``. If - ``EcommerceExampleProductPage`` cannot parse a product page, but it looks - like it might be a valid product page, you would raise - :exc:`~web_poet.exceptions.UseFallback` to try to parse the same product - page with ``FrameworkProductPage``, in case it works. - - .. note:: web-poet does not dictate how to define or use an alternative - parsing implementation as fallback. It is up to web-poet - frameworks to choose how they implement fallback handling. - -- Return an item to override the output of the ``to_item`` method and of - fields. - - For input not matching the expected type of data, returning an item that - indicates so is recommended. - - For example, if your page object parses an e-commerce product, and the - input data corresponds to a list of products rather than a single product, - you could return a product item that somehow indicates that it is not a - valid product item, such as ``Product(is_valid=False)``. - -For example: - -.. code-block:: python - - def validate_input(self): - if self.css('.product-id::text') is not None: - return - if self.css('.http-503-error'): - raise Retry() - if self.css('.product'): - raise UseFallback() - if self.css('.product-list'): - return Product(is_valid=False) - -You may use fields in your implementation of the ``validate_input`` method, but -only synchronous fields are supported. For example: - -.. code-block:: python - - class Page(WebPage[Item]): - def validate_input(self): - if not self.name: - raise UseFallback() - - @field(cached=True) - def name(self): - return self.css(".product-name ::text") - -.. tip:: :ref:`Cache fields ` used in the ``validate_input`` - method, so that when they are used from ``to_item`` they are not - evaluated again. - -If you implement a custom ``to_item`` method, as long as you are inheriting -from :class:`~web_poet.pages.ItemPage`, you can enable input validation -decorating your custom ``to_item`` method with -:func:`~web_poet.util.validates_input`: - -.. code-block:: python - - from web_poet import validates_input - - class Page(ItemPage[Item]): - @validates_input - async def to_item(self): - ... - -:exc:`~web_poet.exceptions.Retry` and :exc:`~web_poet.exceptions.UseFallback` -may also be raised from the ``to_item`` method. This could come in handy, for -example, if after you execute some asynchronous code, such as an -:ref:`additional request `, you find out that you need to -retry the original request or use a fallback. - - -Input Validation Exceptions -=========================== - -.. autoexception:: web_poet.exceptions.PageObjectAction - -.. autoexception:: web_poet.exceptions.Retry - -.. autoexception:: web_poet.exceptions.UseFallback diff --git a/docs/page-objects/inputs.rst b/docs/page-objects/inputs.rst index cb6ff600..0ab3ce95 100644 --- a/docs/page-objects/inputs.rst +++ b/docs/page-objects/inputs.rst @@ -84,3 +84,121 @@ You may define your own input classes if you are using a :ref:`framework However, note that custom input classes may make your :ref:`page object classes ` less portable across frameworks. + + +.. _input-validation: + +Input validation +================ + +Sometimes the data that your page object receives as input may be invalid. + +You can define a ``validate_input`` method in a page object class to check its +input data and determine how to handle invalid input. + +``validate_input`` is called on the first execution of ``ItemPage.to_item()`` +or the first access to a :ref:`field `. In both cases validation +happens early; in the case of fields, it happens before field evaluation. + +``validate_input`` is a synchronous method that expects no parameters, and its +outcome may be any of the following: + +- Return ``None``, indicating that the input is valid. + +.. _retries-input: + +- Raise :exc:`~web_poet.exceptions.Retry`, indicating that the input + looks like the result of a temporary issue, and that trying to fetch + similar input again may result in valid input. + + See also :ref:`retries-additional-requests`. + +- Raise :exc:`~web_poet.exceptions.UseFallback`, indicating that the + page object does not support the input, and that an alternative parsing + implementation should be tried instead. + + For example, imagine you have a page object for website commerce.example, + and that commerce.example is built with a popular e-commerce web framework. + You could have a generic page object for products of websites using that + framework, ``FrameworkProductPage``, and a more specific page object for + commerce.example, ``EcommerceExampleProductPage``. If + ``EcommerceExampleProductPage`` cannot parse a product page, but it looks + like it might be a valid product page, you would raise + :exc:`~web_poet.exceptions.UseFallback` to try to parse the same product + page with ``FrameworkProductPage``, in case it works. + + .. note:: web-poet does not dictate how to define or use an alternative + parsing implementation as fallback. It is up to web-poet + frameworks to choose how they implement fallback handling. + +- Return an item to override the output of the ``to_item`` method and of + fields. + + For input not matching the expected type of data, returning an item that + indicates so is recommended. + + For example, if your page object parses an e-commerce product, and the + input data corresponds to a list of products rather than a single product, + you could return a product item that somehow indicates that it is not a + valid product item, such as ``Product(is_valid=False)``. + +For example: + +.. code-block:: python + + def validate_input(self): + if self.css('.product-id::text') is not None: + return + if self.css('.http-503-error'): + raise Retry() + if self.css('.product'): + raise UseFallback() + if self.css('.product-list'): + return Product(is_valid=False) + +You may use fields in your implementation of the ``validate_input`` method, but +only synchronous fields are supported. For example: + +.. code-block:: python + + class Page(WebPage[Item]): + def validate_input(self): + if not self.name: + raise UseFallback() + + @field(cached=True) + def name(self): + return self.css(".product-name ::text") + +.. tip:: :ref:`Cache fields ` used in the ``validate_input`` + method, so that when they are used from ``to_item`` they are not + evaluated again. + +If you implement a custom ``to_item`` method, as long as you are inheriting +from :class:`~web_poet.pages.ItemPage`, you can enable input validation +decorating your custom ``to_item`` method with +:func:`~web_poet.util.validates_input`: + +.. code-block:: python + + from web_poet import validates_input + + class Page(ItemPage[Item]): + @validates_input + async def to_item(self): + ... + +:exc:`~web_poet.exceptions.Retry` and :exc:`~web_poet.exceptions.UseFallback` +may also be raised from the ``to_item`` method. This could come in handy, for +example, if after you execute some asynchronous code, such as an +:ref:`additional request `, you find out that you need to +retry the original request or use a fallback. + +Input validation exceptions +--------------------------- + +.. autoexception:: web_poet.exceptions.PageObjectAction + +.. autoexception:: web_poet.exceptions.Retry + +.. autoexception:: web_poet.exceptions.UseFallback