Skip to content

Commit

Permalink
Merge pull request #90 from scrapinghub/handle_urls-docs
Browse files Browse the repository at this point in the history
Documentation improvements for overrides (apply rules)
  • Loading branch information
kmike authored Oct 27, 2022
2 parents 383b4f7 + 3e23c51 commit 2c570e2
Show file tree
Hide file tree
Showing 3 changed files with 125 additions and 63 deletions.
121 changes: 108 additions & 13 deletions docs/intro/overrides.rst
Original file line number Diff line number Diff line change
@@ -1,23 +1,118 @@
.. _intro-overrides:

Apply Rules
===========

Overview
--------

@handle_urls
~~~~~~~~~~~~

web-poet provides a :func:`~.handle_urls` decorator, which allows to
declare how the page objects can be used (applied):

* for which websites / URL patterns they work,
* which data type (item classes) they can return,
* which page objects can they replace (override; more on this later).

.. code-block:: python
from web_poet import ItemPage, handle_urls
from my_items import MyItem
@handle_urls("example.com")
class MyPage(ItemPage[MyItem]):
# ...
``handle_urls("example.com")`` can serve as a documentation, but it also enables
getting the information about page objects programmatically.
The information about all page objects decorated with
:func:`~.handle_urls` is stored in ``web_poet.default_registry``, which is
an instance of :class:`~.PageObjectRegistry`. In the example above, the
following :class:`~.ApplyRule` is added to the registry:

.. code-block::
ApplyRule(
for_patterns=Patterns(include=('example.com',), exclude=(), priority=500),
use=<class 'MyPage'>,
instead_of=None,
to_return=<class 'my_items.MyItem'>,
meta={}
)
Note how ``rule.to_return`` is set to ``MyItem`` automatically.
This can be used by libraries like `scrapy-poet`_. For example,
if a spider needs to extract ``MyItem`` from some page on the ``example.com``
website, `scrapy-poet`_ now knows that ``MyPage`` page object can be used.

.. _scrapy-poet: https://scrapy-poet.readthedocs.io

Specifying the URL patterns
~~~~~~~~~~~~~~~~~~~~~~~~~~~

:func:`~handle_urls` decorator uses url-matcher_ library to define the
URL rules. Some examples:

.. code-block:: python
# page object can be applied on any URL from the example.com domain,
# or from any of its subdomains
@handle_urls("example.com")
# page object can be applied on example.com pages under /products/ path
@handle_urls("example.com/products/")
# page object can be applied on any URL from example.com, but only if
# it contains "productId=..." in the query string
@handle_urls("example.com?productId=*")
The string passed to :func:`~.handle_urls` is converted to
a :class:`url_matcher.matcher.Patterns` instance. Please consult
with the url-matcher_ documentation to learn more about the possible rules;
it is pretty flexible. You can exclude patterns, use wildcards,
require certain query parameters to be present and ignore others, etc.
Unlike regexes, this mini-language "understands" the URL structure.

.. _url-matcher: https://url-matcher.readthedocs.io

Overrides
=========
~~~~~~~~~

:func:`~.handle_urls` can be used to declare that a particular Page Object
could (and should) be used *instead of* some other Page Object on
certain URL patterns:

.. code-block:: python
from web_poet import ItemPage, handle_urls
from my_items import Product
from my_pages import DefaultProductPage
@handle_urls("site1.example.com", instead_of=DefaultProductPage)
class Site1ProductPage(ItemPage[Product]):
# ...
@handle_urls("site2.example.com", instead_of=DefaultProductPage)
class Site2ProductPage(ItemPage[Product]):
# ...
Overrides are rules represented by a list of :class:`~.ApplyRule` which
associates which URL patterns a particular Page Object (see :ref:`Page Objects
introduced here <from-ground-up>`) would be used. The URL matching rules is
handled by another library called `url-matcher <https://url-matcher.readthedocs.io>`_.
This concept is a bit more advanced than the basic ``handle_urls`` usage
("this Page Object can return ``MyItem`` on example.com website").

Using such rules establishes the core concept of Overrides wherein a developer
could declare that for a given set of URL patterns, a specific Page Object must
be used instead of another Page Object.
A common use case is a "generic", or a "template" spider, which uses some
default implementation of the extraction, and allows to replace it
("override") on specific websites or URL patterns.

The :class:`~.ApplyRule` also supports pointing to the item returned by a specific
Page Object if it both matches the URL pattern and the item class specified in the
rule.
This default page extraction (``DefaultProductPage`` in the example) can be based on
semantic markup, Machine Learning, heuristics, or just be empty. Page Objects which
can be used instead of the default (``Site1ProductPage``, ``Site2ProductPage``)
are commonly written using XPath or CSS selectors, with website-specific rules.

This enables **web-poet** to be used effectively by other frameworks like
`scrapy-poet <https://scrapy-poet.readthedocs.io>`_.
Libraries like scrapy-poet_ allow to create such "generic" spiders by
using the information declared via ``handle_urls(..., instead_of=...)``.

Example Use Case
----------------
Expand Down
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
Sphinx==5.0.1
Sphinx==5.3.0
sphinx-rtd-theme==1.0.0
65 changes: 16 additions & 49 deletions web_poet/rules.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,61 +40,28 @@ class ApplyRule:
pattern represented by the ``for_patterns`` attribute is matched.
* ``instead_of`` - *(optional)* The Page Object that will be **replaced**
with the Page Object specified via the ``use`` parameter.
* ``to_return`` - *(optional)* The item class that marks the Page Object
to be **used** which is capable of returning that item class.
* ``to_return`` - *(optional)* The item class which the **used**
* ``to_return`` - *(optional)* The item class that the Page Object specified
in ``use`` is capable of returning.
* ``meta`` - *(optional)* Any other information you may want to store.
This doesn't do anything for now but may be useful for future API updates.
The main functionality of this class lies in the ``instead_of`` and ``to_return``
parameters. Should both of these be omitted, then :class:`~.ApplyRule` simply
tags which URL patterns the given Page Object defined in ``use`` is expected
to be used. It works as:
1. Given a URL, match it against the ``for_patterns`` from the registry
rules.
2. This could give us a collection of rules. We need to select one based
on the highest priority set by `url-matcher`_.
3. When a single rule has been selected, use the the Page Object specified
in its ``use`` parameter.
If ``instead_of=None``, this simply means that the Page Object assigned in
the ``use`` parameter will be utilized for all URLs matching the URL pattern
in ``for_patterns``. However, if ``instead_of=ReplacedPageObject``, then it
adds the expectation that the ``ReplacedPageObject`` wouldn't be used for
the given URLs matching ``for_patterns`` since the Page Object in ``use``
will replace it. It works as:
1. Suppose that we have a rule that has ``use=ReplacedPageObject`` which
we want to use against a URL that matches against ``for_patterns``.
2. Before using it, all of the rules from the registry must be checked if
other rules has ``instead_of=ReplacedPageObject`` and matches the
URL patterns in ``for_patterns``.
3. If there are, these rules supersedes the original rule from #1.
4. After selecting one based on the highest priority set by `url-matcher`_,
the Page Object declared in ``use`` should be used instead of
``ReplacedPageObject``.
The ``to_return`` parameter should capture the item class that the Page Object
is capable of returning. Before passing it to :class:`~.ApplyRule`, the
``to_return`` value is primarily derived from the return class specified
from Page Objects that are subclasses of :class:`~.ItemPage` (see this
:ref:`example <item-class-example>`). However, a special case exists when a
Page Object returns a ``dict`` as an item but then the rule should have
``to_return=None`` and **NOT** ``to_return=dict``.
The ``to_return`` parameter is used as a shortcut to directly retrieve the
item from the Page Object to be used for a given URL. It works as:
1. Given a URL and and item class that we want, match it respectively
against ``for_patterns`` and ``to_return`` from the registry rules.
2. This could give us a collection of rules. We need to select one based
on the highest priority set by `url-matcher`_.
3. When a single rule has been selected, create an instance of the Page
Object specified in its ``use`` parameter.
4. Finally, call the ``.to_item()`` method of the Page Object to retrieve
an instance of the item class.
Using the ``to_return`` parameter basically adds the convenient step #4 above.
to be used on.
When ``to_return`` is not None (e.g. ``to_return=MyItem``),
the Page Object in ``use`` is declared as capable of returning a certain
item class (i.e. ``MyItem``).
When ``instead_of`` is not None (e.g. ``instead_of=ReplacedPageObject``),
the rule adds an expectation that the ``ReplacedPageObject`` wouldn't
be used for the URLs matching ``for_patterns``, since the Page Object
in ``use`` will replace it.
If there are multiple rules which match a certain URL, the rule
to apply is picked based on the priorities set in ``for_patterns``.
More information regarding its usage in :ref:`intro-overrides`.
Expand Down

0 comments on commit 2c570e2

Please sign in to comment.