Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation improvements for overrides (apply rules) #90

Merged
merged 6 commits into from
Oct 27, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 106 additions & 13 deletions docs/intro/overrides.rst
Original file line number Diff line number Diff line change
@@ -1,23 +1,116 @@
.. _intro-overrides:

Apply Rules
===========

Overview
--------

@handle_urls
~~~~~~~~~~~~

web-poet provides a :func:`~.handle_urls` decorator, which allows to
declare how the page objects can be used (applied):

* for which websites / URL patterns they work,
* which data type (item classes) they can return,
* which page objects can they replace (override; more on this later).

.. code-block:: python

from web_poet import ItemPage, handle_urls
from my_items import MyItem

@handle_urls("example.com")
class MyPage(ItemPage[MyItem]):
# ...


``handle_urls("example.com")`` can serve as a documentation, but it also enables
getting the information about page objects programmatically.
The information about all page objects decorated with
:func:`~.handle_urls` is stored in ``web_poet.default_registry``, which is
an instance of :class:`~.PageObjectRegistry`. In the example above the
kmike marked this conversation as resolved.
Show resolved Hide resolved
following :class:`~.ApplyRule` is added to the registry:

.. code-block::
BurnzZ marked this conversation as resolved.
Show resolved Hide resolved

ApplyRule(
for_patterns=Patterns(include=('example.com',), exclude=(), priority=500),
use=<class 'MyPage'>,
instead_of=None,
to_return=<class 'MyItem'>,
kmike marked this conversation as resolved.
Show resolved Hide resolved
meta={}
)

Note how ``rule.to_return`` is set to ``MyItem`` automatically.
This can be used by libraries like `scrapy-poet`_. For example,
if a spider needs to extract ``MyItem`` from some page on the ``example.com``
website, `scrapy-poet`_ now knows that ``MyPage`` page object can be used.

.. _scrapy-poet: https://scrapy-poet.readthedocs.io

Specifying the URL patterns
~~~~~~~~~~~~~~~~~~~~~~~~~~~

:func:`~handle_urls` decorator uses url-matcher_ library to define the
URL rules. Some examples:

.. code-block:: python

# page object can be applied on any URL from the example.com domain,
# or from any of its subdomains
@handle_urls("example.com")

# page object can be applied on example.com pages under /products/ path
@handle_urls("example.com/products/")

# page object can be applied on any URL from example.com, but only if
# it contains "productId=..." in the query string
@handle_urls("example.com?productId=*")

Please consult with the url-matcher_ documentation for more; it is pretty
BurnzZ marked this conversation as resolved.
Show resolved Hide resolved
flexible. It is possible to exclude patterns, use wildcards, require certain
query parameters to be present and ignore others, etc.;
unlike regexes, this mini-language "understands" the URL structure.

.. _url-matcher: https://url-matcher.readthedocs.io

Overrides
=========
~~~~~~~~~

:func:`~.handle_urls` can be used to declare that a particular Page Object
could (and should) be used *instead of* some other Page Object on
certain URL patterns:

.. code-block:: python

from web_poet import ItemPage, handle_urls
from my_items import Product
from my_pages import DefaultProductPage

@handle_urls("site1.example.com", instead_of=DefaultProductPage)
class Site1ProductPage(ItemPage[Product]):
# ...

@handle_urls("site2.example.com", instead_of=DefaultProductPage)
class Site2ProductPage(ItemPage[Product]):
# ...

Overrides are rules represented by a list of :class:`~.ApplyRule` which
associates which URL patterns a particular Page Object (see :ref:`Page Objects
introduced here <from-ground-up>`) would be used. The URL matching rules is
handled by another library called `url-matcher <https://url-matcher.readthedocs.io>`_.
This concept is a bit more advanced than the basic ``handle_urls`` usage
("this Page Object can return MyItem on example.com website").
kmike marked this conversation as resolved.
Show resolved Hide resolved

Using such rules establishes the core concept of Overrides wherein a developer
could declare that for a given set of URL patterns, a specific Page Object must
be used instead of another Page Object.
A common use case is a "generic", or a "template" spider, which uses some
default implementation of the extraction, and allows to replace it
("override") on specific websites or URL patterns.

The :class:`~.ApplyRule` also supports pointing to the item returned by a specific
Page Object if it both matches the URL pattern and the item class specified in the
rule.
This default (``DefaultProductPage`` in the example) can be based on
kmike marked this conversation as resolved.
Show resolved Hide resolved
semantic markup, Machine Learning, heuristics, or just be empty. Page Objects which
can be used instead of the default (``Site1ProductPage``, ``Site2ProductPage``)
are commonly written using XPath or CSS selectors, with website-specific rules.

This enables **web-poet** to be used effectively by other frameworks like
`scrapy-poet <https://scrapy-poet.readthedocs.io>`_.
Libraries like scrapy-poet_ allow to create such "generic" spiders by
using the information declared via ``handle_urls(..., instead_of=...)``.

Example Use Case
----------------
Expand Down
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
Sphinx==5.0.1
Sphinx==5.3.0
sphinx-rtd-theme==1.0.0
64 changes: 15 additions & 49 deletions web_poet/rules.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,61 +40,27 @@ class ApplyRule:
pattern represented by the ``for_patterns`` attribute is matched.
* ``instead_of`` - *(optional)* The Page Object that will be **replaced**
with the Page Object specified via the ``use`` parameter.
* ``to_return`` - *(optional)* The item class that marks the Page Object
to be **used** which is capable of returning that item class.
* ``to_return`` - *(optional)* The item class which the **used**
Page Object is capable of returning.
kmike marked this conversation as resolved.
Show resolved Hide resolved
* ``meta`` - *(optional)* Any other information you may want to store.
This doesn't do anything for now but may be useful for future API updates.

The main functionality of this class lies in the ``instead_of`` and ``to_return``
parameters. Should both of these be omitted, then :class:`~.ApplyRule` simply
tags which URL patterns the given Page Object defined in ``use`` is expected
to be used. It works as:

1. Given a URL, match it against the ``for_patterns`` from the registry
rules.
2. This could give us a collection of rules. We need to select one based
on the highest priority set by `url-matcher`_.
3. When a single rule has been selected, use the the Page Object specified
in its ``use`` parameter.

If ``instead_of=None``, this simply means that the Page Object assigned in
the ``use`` parameter will be utilized for all URLs matching the URL pattern
in ``for_patterns``. However, if ``instead_of=ReplacedPageObject``, then it
adds the expectation that the ``ReplacedPageObject`` wouldn't be used for
the given URLs matching ``for_patterns`` since the Page Object in ``use``
will replace it. It works as:

1. Suppose that we have a rule that has ``use=ReplacedPageObject`` which
we want to use against a URL that matches against ``for_patterns``.
2. Before using it, all of the rules from the registry must be checked if
other rules has ``instead_of=ReplacedPageObject`` and matches the
URL patterns in ``for_patterns``.
3. If there are, these rules supersedes the original rule from #1.
4. After selecting one based on the highest priority set by `url-matcher`_,
the Page Object declared in ``use`` should be used instead of
``ReplacedPageObject``.

The ``to_return`` parameter should capture the item class that the Page Object
is capable of returning. Before passing it to :class:`~.ApplyRule`, the
``to_return`` value is primarily derived from the return class specified
from Page Objects that are subclasses of :class:`~.ItemPage` (see this
:ref:`example <item-class-example>`). However, a special case exists when a
Page Object returns a ``dict`` as an item but then the rule should have
``to_return=None`` and **NOT** ``to_return=dict``.

The ``to_return`` parameter is used as a shortcut to directly retrieve the
item from the Page Object to be used for a given URL. It works as:

1. Given a URL and and item class that we want, match it respectively
against ``for_patterns`` and ``to_return`` from the registry rules.
2. This could give us a collection of rules. We need to select one based
on the highest priority set by `url-matcher`_.
3. When a single rule has been selected, create an instance of the Page
Object specified in its ``use`` parameter.
4. Finally, call the ``.to_item()`` method of the Page Object to retrieve
an instance of the item class.

Using the ``to_return`` parameter basically adds the convenient step #4 above.
to be used on.

When ``to_return`` is not None (e.g. ``to_return=MyItem``),
the Page Object in ``use`` is declared as capable of returning a certain
item class (``MyItem``).
kmike marked this conversation as resolved.
Show resolved Hide resolved

When ``instead_of`` is not None (e.g. ``instead_of=ReplacedPageObject``),
the rule adds an expectation that the ``ReplacedPageObject`` wouldn't
be used for the URLs matching ``for_patterns``, since the Page Object
in ``use`` will replace it.

If there are multuple rules which match a certain URL, the rule
kmike marked this conversation as resolved.
Show resolved Hide resolved
to apply is picked based on the priorities set in ``for_patterns``.

More information regarding its usage in :ref:`intro-overrides`.

Expand Down