Merge pull request #90 from scrapinghub/handle_urls-docs

Documentation improvements for overrides (apply rules)
scrapinghub · Oct 27, 2022 · 2c570e2 · 2c570e2
2 parents 383b4f7 + 3e23c51
commit 2c570e2
Show file tree

Hide file tree

Showing 3 changed files with 125 additions and 63 deletions.
diff --git a/docs/intro/overrides.rst b/docs/intro/overrides.rst
@@ -1,23 +1,118 @@
 .. _intro-overrides:
 
+Apply Rules
+===========
+
+Overview
+--------
+
+@handle_urls
+~~~~~~~~~~~~
+
+web-poet provides a :func:`~.handle_urls` decorator, which allows to
+declare how the page objects can be used (applied):
+
+* for which websites / URL patterns they work,
+* which data type (item classes) they can return,
+* which page objects can they replace (override; more on this later).
+
+.. code-block:: python
+
+    from web_poet import ItemPage, handle_urls
+    from my_items import MyItem
+
+    @handle_urls("example.com")
+    class MyPage(ItemPage[MyItem]):
+        # ...
+
+
+``handle_urls("example.com")`` can serve as a documentation, but it also enables
+getting the information about page objects programmatically.
+The information about all page objects decorated with
+:func:`~.handle_urls` is stored in ``web_poet.default_registry``, which is
+an instance of :class:`~.PageObjectRegistry`. In the example above, the
+following :class:`~.ApplyRule` is added to the registry:
+
+.. code-block::
+
+    ApplyRule(
+        for_patterns=Patterns(include=('example.com',), exclude=(), priority=500),
+        use=<class 'MyPage'>,
+        instead_of=None,
+        to_return=<class 'my_items.MyItem'>,
+        meta={}
+    )
+
+Note how ``rule.to_return`` is set to ``MyItem`` automatically.
+This can be used by libraries like `scrapy-poet`_. For example,
+if a spider needs to extract ``MyItem`` from some page on the ``example.com``
+website, `scrapy-poet`_ now knows that ``MyPage`` page object can be used.
+
+.. _scrapy-poet: https://scrapy-poet.readthedocs.io
+
+Specifying the URL patterns
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+:func:`~handle_urls` decorator uses url-matcher_ library to define the
+URL rules. Some examples:
+
+.. code-block:: python
+
+    # page object can be applied on any URL from the example.com domain,
+    # or from any of its subdomains
+    @handle_urls("example.com")
+
+    # page object can be applied on example.com pages under /products/ path
+    @handle_urls("example.com/products/")
+
+    # page object can be applied on any URL from example.com, but only if
+    # it contains "productId=..." in the query string
+    @handle_urls("example.com?productId=*")
+
+The string passed to :func:`~.handle_urls` is converted to
+a :class:`url_matcher.matcher.Patterns` instance. Please consult
+with the url-matcher_ documentation to learn more about the possible rules;
+it is pretty flexible. You can exclude patterns, use wildcards,
+require certain query parameters to be present and ignore others, etc.
+Unlike regexes, this mini-language "understands" the URL structure.
+
+.. _url-matcher: https://url-matcher.readthedocs.io
+
 Overrides
-=========
+~~~~~~~~~
+
+:func:`~.handle_urls` can be used to declare that a particular Page Object
+could (and should) be used *instead of* some other Page Object on
+certain URL patterns:
+
+.. code-block:: python
+
+    from web_poet import ItemPage, handle_urls
+    from my_items import Product
+    from my_pages import DefaultProductPage
+
+    @handle_urls("site1.example.com", instead_of=DefaultProductPage)
+    class Site1ProductPage(ItemPage[Product]):
+        # ...
+
+    @handle_urls("site2.example.com", instead_of=DefaultProductPage)
+    class Site2ProductPage(ItemPage[Product]):
+        # ...
 
-Overrides are rules represented by a list of :class:`~.ApplyRule` which
-associates which URL patterns a particular Page Object (see :ref:`Page Objects
-introduced here <from-ground-up>`) would be used. The URL matching rules is
-handled by another library called `url-matcher <https://url-matcher.readthedocs.io>`_.
+This concept is a bit more advanced than the basic ``handle_urls`` usage
+("this Page Object can return ``MyItem`` on example.com website").
 
-Using such rules establishes the core concept of Overrides wherein a developer
-could declare that for a given set of URL patterns, a specific Page Object must
-be used instead of another Page Object.
+A common use case is a "generic", or a "template" spider, which uses some
+default implementation of the extraction, and allows to replace it
+("override") on specific websites or URL patterns.
 
-The :class:`~.ApplyRule` also supports pointing to the item returned by a specific
-Page Object if it both matches the URL pattern and the item class specified in the
-rule.
+This default page extraction (``DefaultProductPage`` in the example) can be based on
+semantic markup, Machine Learning, heuristics, or just be empty. Page Objects which
+can be used instead of the default (``Site1ProductPage``, ``Site2ProductPage``)
+are commonly written using XPath or CSS selectors, with website-specific rules.
 
-This enables **web-poet** to be used effectively by other frameworks like 
-`scrapy-poet <https://scrapy-poet.readthedocs.io>`_.
+Libraries like scrapy-poet_ allow to create such "generic" spiders by
+using the information declared via ``handle_urls(..., instead_of=...)``.
 
 Example Use Case
 ----------------

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,2 +1,2 @@
-Sphinx==5.0.1
+Sphinx==5.3.0
 sphinx-rtd-theme==1.0.0
diff --git a/web_poet/rules.py b/web_poet/rules.py
@@ -40,61 +40,28 @@ class ApplyRule:
           pattern represented by the ``for_patterns`` attribute is matched.
         * ``instead_of`` - *(optional)* The Page Object that will be **replaced**
           with the Page Object specified via the ``use`` parameter.
-        * ``to_return`` - *(optional)* The item class that marks the Page Object
-          to be **used** which is capable of returning that item class.
+        * ``to_return`` - *(optional)* The item class which the **used**
+        * ``to_return`` - *(optional)* The item class that the Page Object specified
+          in ``use`` is capable of returning.
         * ``meta`` - *(optional)* Any other information you may want to store.
           This doesn't do anything for now but may be useful for future API updates.
 
     The main functionality of this class lies in the ``instead_of`` and ``to_return``
     parameters. Should both of these be omitted, then :class:`~.ApplyRule` simply
     tags which URL patterns the given Page Object defined in ``use`` is expected
-    to be used. It works as:
-
-        1. Given a URL, match it against the ``for_patterns`` from the registry
-           rules.
-        2. This could give us a collection of rules. We need to select one based
-           on the highest priority set by `url-matcher`_.
-        3. When a single rule has been selected, use the the Page Object specified
-           in its ``use`` parameter.
-
-    If ``instead_of=None``, this simply means that the Page Object assigned in
-    the ``use`` parameter will be utilized for all URLs matching the URL pattern
-    in ``for_patterns``. However, if ``instead_of=ReplacedPageObject``, then it
-    adds the expectation that the ``ReplacedPageObject`` wouldn't be used for
-    the given URLs matching ``for_patterns`` since the Page Object in ``use``
-    will replace it. It works as:
-
-        1. Suppose that we have a rule that has ``use=ReplacedPageObject`` which
-           we want to use against a URL that matches against ``for_patterns``.
-        2. Before using it, all of the rules from the registry must be checked if
-           other rules has ``instead_of=ReplacedPageObject`` and matches the
-           URL patterns in ``for_patterns``.
-        3. If there are, these rules supersedes the original rule from #1.
-        4. After selecting one based on the highest priority set by `url-matcher`_,
-           the Page Object declared in ``use`` should be used instead of
-           ``ReplacedPageObject``.
-
-    The ``to_return`` parameter should capture the item class that the Page Object
-    is capable of returning. Before passing it to :class:`~.ApplyRule`, the
-    ``to_return`` value is primarily derived from the return class specified
-    from Page Objects that are subclasses of :class:`~.ItemPage` (see this
-    :ref:`example <item-class-example>`). However, a special case exists when a
-    Page Object returns a ``dict`` as an item but then the rule should have
-    ``to_return=None`` and **NOT** ``to_return=dict``.
-
-    The ``to_return`` parameter is used as a shortcut to directly retrieve the
-    item from the Page Object to be used for a given URL. It works as:
-
-        1. Given a URL and and item class that we want, match it respectively
-           against ``for_patterns`` and ``to_return`` from the registry rules.
-        2. This could give us a collection of rules. We need to select one based
-           on the highest priority set by `url-matcher`_.
-        3. When a single rule has been selected, create an instance of the Page
-           Object specified in its ``use`` parameter.
-        4. Finally, call the ``.to_item()`` method of the Page Object to retrieve
-           an instance of the item class.
-
-    Using the ``to_return`` parameter basically adds the convenient step #4 above.
+    to be used on.
+
+    When ``to_return`` is not None (e.g. ``to_return=MyItem``),
+    the Page Object in ``use`` is declared as capable of returning a certain
+    item class (i.e. ``MyItem``).
+
+    When ``instead_of`` is not None (e.g. ``instead_of=ReplacedPageObject``),
+    the rule adds an expectation that the ``ReplacedPageObject`` wouldn't
+    be used for the URLs matching ``for_patterns``, since the Page Object
+    in ``use`` will replace it.
+
+    If there are multiple rules which match a certain URL, the rule
+    to apply is picked based on the priorities set in ``for_patterns``.
 
     More information regarding its usage in :ref:`intro-overrides`.