diff --git a/.flake8 b/.flake8 index a326177c..726947fe 100644 --- a/.flake8 +++ b/.flake8 @@ -34,5 +34,7 @@ per-file-ignores = # imports are there to expose submodule functions so they can be imported # directly from that module # F403: Ignore * imports in these files + # D102: Missing docstring in public method web_poet/__init__.py:F401,F403 web_poet/page_inputs/__init__.py:F401,F403 + tests/po_lib_to_return/__init__.py:D102 diff --git a/CHANGELOG.rst b/CHANGELOG.rst index ab3e8d51..c283579a 100644 --- a/CHANGELOG.rst +++ b/CHANGELOG.rst @@ -2,6 +2,42 @@ Changelog ========= +TBD +--- + +* New ``ApplyRule`` class created by the ``@handle_urls`` decorator. This is + nearly identical with ``OverrideRule`` except: + + * It's now accepting a ``to_return`` parameter which signifies the data + container class that the Page Object returns. + * Passing a string to ``for_patterns`` would auto-convert it into + ``url_matcher.Patterns``. + * All arguments are now keyword-only except for ``for_patterns``. + +* Modify the call signature and behavior of ``handle_urls``: + + * New ``instead_of`` parameter which does the same thing as ``overrides``. + * The old ``overrides`` parameter is not required anymore as it's set for + deprecation. + * It sets a ``to_return`` parameter when creating ``ApplyRule`` based on the + declared item class in subclasses of ``web_poet.ItemPage``. It's also + possible to pass a ``to_return`` parameter on more advanced use cases. + +* Documentation, test, and warning message improvements. + +Deprecations: + +* The ``overrides`` parameter from ``@handle_urls`` is now deprecated. + Use the ``instead_of`` parameter instead. +* The ``OverrideRule`` class is now deprecated. Use ``ApplyRule`` instead. +* The ``from_override_rules`` method of ``PageObjectRegistry`` is now deprecated. + Use ``from_apply_rules`` instead. +* The ``web_poet.overrides`` module is deprecated. Use ``web_poet.rules`` instead. +* The ``PageObjectRegistry.get_overrides`` method is deprecated. + Use ``PageObjectRegistry.get_rules`` instead. +* The ``PageObjectRegistry.search_overrides`` method is deprecated. + Use ``PageObjectRegistry.search_rules`` instead. + 0.5.1 (2022-09-23) ------------------ diff --git a/docs/advanced/additional-requests.rst b/docs/advanced/additional-requests.rst index c9effa47..76b8c701 100644 --- a/docs/advanced/additional-requests.rst +++ b/docs/advanced/additional-requests.rst @@ -1,4 +1,4 @@ -.. _`advanced-requests`: +.. _advanced-requests: =================== Additional Requests @@ -27,7 +27,7 @@ The key words "MUST”, "MUST NOT”, "REQUIRED”, "SHALL”, "SHALL NOT”, "S "SHOULD NOT”, "RECOMMENDED”, "MAY”, and "OPTIONAL” in this document are to be interpreted as described in RFC `2119 `_. -.. _`httprequest-example`: +.. _httprequest-example: HttpRequest =========== @@ -271,7 +271,7 @@ The key take aways for this example are: available. -.. _`httpclient`: +.. _httpclient: HttpClient ========== @@ -337,7 +337,7 @@ additional requests using the :meth:`~.HttpClient.request`, :meth:`~.HttpClient. and :meth:`~.HttpClient.post` methods of :class:`~.HttpClient`. These already define the :class:`~.HttpRequest` and executes it as well. -.. _`httpclient-get-example`: +.. _httpclient-get-example: A simple ``GET`` request ------------------------ @@ -376,7 +376,7 @@ There are a few things to take note in this example: * There is no need create an instance of :class:`~.HttpRequest` when :meth:`~.HttpClient.get` is used. -.. _`request-post-example`: +.. _request-post-example: A ``POST`` request with `header` and `body` ------------------------------------------- @@ -459,7 +459,7 @@ quick shortcuts for :meth:`~.HttpClient.request`: Thus, apart from the common ``GET`` and ``POST`` HTTP methods, you can use :meth:`~.HttpClient.request` for them (`e.g.` ``HEAD``, ``PUT``, ``DELETE``, etc). -.. _`http-batch-request-example`: +.. _http-batch-request-example: Batch requests -------------- @@ -567,7 +567,7 @@ The key takeaways for this example are: first response from a group of requests as early as possible. However, the order could be shuffled. -.. _`exception-handling`: +.. _exception-handling: Handling Exceptions in Page Objects =================================== diff --git a/docs/advanced/fields.rst b/docs/advanced/fields.rst index 11e8a7d6..64e39619 100644 --- a/docs/advanced/fields.rst +++ b/docs/advanced/fields.rst @@ -179,7 +179,9 @@ It's also possible to implement field cleaning and processing in ``to_item`` but in that case accessing a field directly will return the value without processing, so it's preferable to use field processors instead. -Item classes +.. _item-classes: + +Item Classes ------------ In all previous examples, ``to_item`` methods are returning ``dict`` @@ -220,7 +222,7 @@ its ``to_item()`` method starts to return item instances, instead of ``dict`` instances. In the example above ``ProductPage.to_item`` method returns ``Product`` instances. -Defining an Item class may be an overkill if you only have a single Page Object, +Defining an item class may be an overkill if you only have a single Page Object, but item classes are of a great help when * you need to extract data in the same format from multiple websites, or @@ -265,8 +267,8 @@ indicating that a required argument is missing. Without an item class, none of these errors are detected. -Changing Item type -~~~~~~~~~~~~~~~~~~ +Changing Item Class +~~~~~~~~~~~~~~~~~~~ Let's say there is a Page Object implemented, which outputs some standard item. Maybe there is a library of such Page Objects available. But for a @@ -333,7 +335,7 @@ to the item: # ... Note how :class:`~.Returns` is used as one of the base classes of -``CustomFooPage``; it allows to change the item type returned by a page object. +``CustomFooPage``; it allows to change the item class returned by a page object. Removing fields (as well as renaming) is a bit more tricky. @@ -344,7 +346,7 @@ inherit from the "base", "standard" Page Object, there could be a ``@field`` from the base class which is not present in the ``CustomItem``. It'd be still passed to ``CustomItem.__init__``, causing an exception. -One way to solve it is to make the orignal Page Object a dependency +One way to solve it is to make the original Page Object a dependency instead of inheriting from it, as explained in the beginning. Alternatively, you can use ``skip_nonitem_fields=True`` class argument - it tells @@ -368,7 +370,7 @@ is passed, and ``name`` is the only field ``CustomItem`` supports. To recap: -* Use ``Returns[NewItemType]`` to change the item type in a subclass. +* Use ``Returns[NewItemType]`` to change the item class in a subclass. * Don't use ``skip_nonitem_fields=True`` when your Page Object corresponds to an item exactly, or when you're only adding fields. This is a safe approach, which allows to detect typos in field names, even for optional diff --git a/docs/api-reference.rst b/docs/api-reference.rst index 2f931e86..a82544c4 100644 --- a/docs/api-reference.rst +++ b/docs/api-reference.rst @@ -1,4 +1,4 @@ -.. _`api-reference`: +.. _api-reference: ============= API Reference @@ -81,7 +81,7 @@ Exceptions :show-inheritance: :members: -.. _`api-overrides`: +.. _api-overrides: Overrides ========= @@ -91,7 +91,7 @@ use cases and some examples. .. autofunction:: web_poet.handle_urls -.. automodule:: web_poet.overrides +.. automodule:: web_poet.rules :members: :exclude-members: handle_urls diff --git a/docs/index.rst b/docs/index.rst index 77c852e0..e88157b4 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -53,7 +53,7 @@ and the motivation behind ``web-poet``, start with :ref:`from-ground-up`. changelog license -.. _`web-poet`: https://github.com/scrapinghub/web-poet +.. _web-poet: https://github.com/scrapinghub/web-poet .. _Scrapy: https://scrapy.org/ .. _scrapy-poet: https://github.com/scrapinghub/scrapy-poet diff --git a/docs/intro/from-ground-up.rst b/docs/intro/from-ground-up.rst index 9af252d5..406eebcc 100644 --- a/docs/intro/from-ground-up.rst +++ b/docs/intro/from-ground-up.rst @@ -1,4 +1,4 @@ -.. _`from-ground-up`: +.. _from-ground-up: =========================== web-poet from the ground up diff --git a/docs/intro/overrides.rst b/docs/intro/overrides.rst index 0939e13b..8431ef95 100644 --- a/docs/intro/overrides.rst +++ b/docs/intro/overrides.rst @@ -1,18 +1,118 @@ -.. _`intro-overrides`: +.. _intro-overrides: + +Apply Rules +=========== + +Overview +-------- + +@handle_urls +~~~~~~~~~~~~ + +web-poet provides a :func:`~.handle_urls` decorator, which allows to +declare how the page objects can be used (applied): + +* for which websites / URL patterns they work, +* which data type (item classes) they can return, +* which page objects can they replace (override; more on this later). + +.. code-block:: python + + from web_poet import ItemPage, handle_urls + from my_items import MyItem + + @handle_urls("example.com") + class MyPage(ItemPage[MyItem]): + # ... + + +``handle_urls("example.com")`` can serve as a documentation, but it also enables +getting the information about page objects programmatically. +The information about all page objects decorated with +:func:`~.handle_urls` is stored in ``web_poet.default_registry``, which is +an instance of :class:`~.PageObjectRegistry`. In the example above, the +following :class:`~.ApplyRule` is added to the registry: + +.. code-block:: + + ApplyRule( + for_patterns=Patterns(include=('example.com',), exclude=(), priority=500), + use=, + instead_of=None, + to_return=, + meta={} + ) + +Note how ``rule.to_return`` is set to ``MyItem`` automatically. +This can be used by libraries like `scrapy-poet`_. For example, +if a spider needs to extract ``MyItem`` from some page on the ``example.com`` +website, `scrapy-poet`_ now knows that ``MyPage`` page object can be used. + +.. _scrapy-poet: https://scrapy-poet.readthedocs.io + +Specifying the URL patterns +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:func:`~handle_urls` decorator uses url-matcher_ library to define the +URL rules. Some examples: + +.. code-block:: python + + # page object can be applied on any URL from the example.com domain, + # or from any of its subdomains + @handle_urls("example.com") + + # page object can be applied on example.com pages under /products/ path + @handle_urls("example.com/products/") + + # page object can be applied on any URL from example.com, but only if + # it contains "productId=..." in the query string + @handle_urls("example.com?productId=*") + +The string passed to :func:`~.handle_urls` is converted to +a :class:`url_matcher.matcher.Patterns` instance. Please consult +with the url-matcher_ documentation to learn more about the possible rules; +it is pretty flexible. You can exclude patterns, use wildcards, +require certain query parameters to be present and ignore others, etc. +Unlike regexes, this mini-language "understands" the URL structure. + +.. _url-matcher: https://url-matcher.readthedocs.io Overrides -========= +~~~~~~~~~ + +:func:`~.handle_urls` can be used to declare that a particular Page Object +could (and should) be used *instead of* some other Page Object on +certain URL patterns: + +.. code-block:: python + + from web_poet import ItemPage, handle_urls + from my_items import Product + from my_pages import DefaultProductPage -Overrides contains mapping rules to associate which URLs a particular -Page Object would be used. The URL matching rules is handled by another library -called `url-matcher `_. + @handle_urls("site1.example.com", instead_of=DefaultProductPage) + class Site1ProductPage(ItemPage[Product]): + # ... -Using such rules establishes the core concept of Overrides wherein a developer -could declare that a specific Page Object must be used *(instead of another)* -for a given set of URL patterns. + @handle_urls("site2.example.com", instead_of=DefaultProductPage) + class Site2ProductPage(ItemPage[Product]): + # ... -This enables **web-poet** to be used effectively by other frameworks like -`scrapy-poet `_. +This concept is a bit more advanced than the basic ``handle_urls`` usage +("this Page Object can return ``MyItem`` on example.com website"). + +A common use case is a "generic", or a "template" spider, which uses some +default implementation of the extraction, and allows to replace it +("override") on specific websites or URL patterns. + +This default page extraction (``DefaultProductPage`` in the example) can be based on +semantic markup, Machine Learning, heuristics, or just be empty. Page Objects which +can be used instead of the default (``Site1ProductPage``, ``Site2ProductPage``) +are commonly written using XPath or CSS selectors, with website-specific rules. + +Libraries like scrapy-poet_ allow to create such "generic" spiders by +using the information declared via ``handle_urls(..., instead_of=...)``. Example Use Case ---------------- @@ -27,7 +127,7 @@ going to crawl beforehand. However, we could at least create a generic Page Object to support parsing of some fields in well-known locations of product information like ````. This enables our broadcrawler to at least parse some useful information. Let's -call such Page Object to be ``GenericProductPage``. +call such a Page Object to be ``GenericProductPage``. Assuming that one of our project requirements is to fully support parsing of the `top 3 eCommerce websites`, then we'd need to create a Page Object for each one @@ -50,6 +150,28 @@ Let's see this in action by declaring the Overrides in the Page Objects below. Creating Overrides ------------------ +To simplify the code examples in the next few subsections, let's assume that +these item classes have been predefined: + +.. code-block:: python + + import attrs + + + @attrs.define + class Product: + product_title: str + regular_price: float + + + @attrs.define + class SimilarProduct: + product_title: str + regular_price: float + +Page Object +~~~~~~~~~~~ + Let's take a look at how the following code is structured: .. code-block:: python @@ -58,84 +180,199 @@ Let's take a look at how the following code is structured: class GenericProductPage(WebPage): - def to_item(self): - return {"product-title": self.css("title::text").get()} + def to_item(self) -> Product: + return Product(product_title=self.css("title::text").get()) - @handle_urls("example.com", overrides=GenericProductPage) + @handle_urls("some.example", instead_of=GenericProductPage) class ExampleProductPage(WebPage): - def to_item(self): - ... # more specific parsing + ... # more specific parsing - @handle_urls("anotherexample.com", overrides=GenericProductPage, exclude="/digital-goods/") + @handle_urls("another.example", instead_of=GenericProductPage, exclude="/digital-goods/") class AnotherExampleProductPage(WebPage): - def to_item(self): - ... # more specific parsing + ... # more specific parsing - @handle_urls(["dualexample.com/shop/?product=*", "dualexample.net/store/?pid=*"], overrides=GenericProductPage) + @handle_urls(["dual.example/shop/?product=*", "uk.dual.example/store/?pid=*"], instead_of=GenericProductPage) class DualExampleProductPage(WebPage): - def to_item(self): - ... # more specific parsing + ... # more specific parsing The code above declares that: - - For sites that match the ``example.com`` pattern, ``ExampleProductPage`` + - The Page Objects return ``Product`` and ``SimilarProduct`` item classes. + Returning item classes is a preferred approach as explained in the + :ref:`web-poet-fields` section. + - For sites that match the ``some.example`` pattern, ``ExampleProductPage`` would be used instead of ``GenericProductPage``. - The same is true for ``DualExampleProductPage`` where it is used instead of ``GenericProductPage`` for two URL patterns which works as something like: - - :sub:`(match) https://www.dualexample.com/shop/electronics/?product=123` - - :sub:`(match) https://www.dualexample.com/shop/books/paperback/?product=849` - - :sub:`(NO match) https://www.dualexample.com/on-sale/books/?product=923` - - :sub:`(match) https://www.dualexample.net/store/kitchen/?pid=776` - - :sub:`(match) https://www.dualexample.net/store/?pid=892` - - :sub:`(NO match) https://www.dualexample.net/new-offers/fitness/?pid=892` + - :sub:`(match) https://www.dual.example/shop/electronics/?product=123` + - :sub:`(match) https://www.dual.example/shop/books/paperback/?product=849` + - :sub:`(NO match) https://www.dual.example/on-sale/books/?product=923` + - :sub:`(match) https://www.uk.dual.example/store/kitchen/?pid=776` + - :sub:`(match) https://www.uk.dual.example/store/?pid=892` + - :sub:`(NO match) https://www.uk.dual.example/new-offers/fitness/?pid=892` - - On the other hand, ``AnotherExampleProductPage`` is only used instead of - ``GenericProductPage`` when we're handling pages from ``anotherexample.com`` - that doesn't contain ``/digital-goods/`` in its URL path. + - On the other hand, ``AnotherExampleProductPage`` is used instead of + ``GenericProductPage`` when we're handling pages that match the + ``another.example`` URL Pattern, which doesn't contain + ``/digital-goods/`` in its URL path. .. tip:: - The URL patterns declared in the ``@handle_urls`` annotation can still be + The URL patterns declared in the ``@handle_urls`` decorator can still be further customized. You can read some of the specific parameters in the :ref:`API section <api-overrides>` of :func:`web_poet.handle_urls`. +.. _item-class-example: + +Item Class +~~~~~~~~~~ + +An alternative approach for the Page Object Overrides example above is to specify +the returned item class. For example, we could change the previous example into +the following: + + +.. code-block:: python + + from web_poet import handle_urls, WebPage + + + class GenericProductPage(WebPage[Product]): + def to_item(self) -> Product: + return Product(product_title=self.css("title::text").get()) + + + @handle_urls("some.example") + class ExampleProductPage(WebPage[Product]): + ... # more specific parsing + + + @handle_urls("another.example", exclude="/digital-goods/") + class AnotherExampleProductPage(WebPage[Product]): + ... # more specific parsing + + + @handle_urls(["dual.example/shop/?product=*", "uk.dual.example/store/?pid=*"]) + class DualExampleProductPage(WebPage[Product]): + ... # more specific parsing + +Let's break this example down: + + - The URL patterns are exactly the same as with the previous code example. + - The ``@handle_urls`` decorator determines the item class to return (i.e. + ``Product``) from the decorated Page Object. + - The ``instead_of`` parameter can be omitted in lieu of the derived Item + Class from the Page Object which becomes the ``to_return`` attribute in + :class:`~.ApplyRule` instances. This means that: + + - If a ``Product`` item class is requested for URLs matching with the + "some.example" pattern, then the ``Product`` item class would come from + the ``to_item()`` method of ``ExampleProductPage``. + - Similarly, if a page with a URL matches with "another.example" without + the "/digital-goods/" path, then the ``Product`` item class comes from + the ``AnotherExampleProductPage`` Page Object. + - However, if a ``Product`` item class is requested matching with the URL + pattern of "dual.example/shop/?product=*", a ``SimilarProduct`` + item class is returned by the ``DualExampleProductPage``'s ``to_item()`` + method instead. + +Specifying the item class that a Page Object returns makes it possible for +web-poet frameworks to make Page Object usage transparent to end users. + +For example, a web-poet framework could implement a function like: + +.. code-block:: python + + item = get_item(url, item_class=Product) + +Here there is no reference to the Page Object being used underneath, you only +need to indicate the desired item class, and the web-poet framework +automatically determines the Page Object to use based on the specified URL and +the specified item class. + +Note, however, that web-poet frameworks are encouraged to also allow getting a +Page Object instead of an item class instance, for scenarios where end users +wish access to Page Object attributes and methods. + + +.. _combination: + +Combination +~~~~~~~~~~~ + +Of course, you can use the combination of both which enables you to specify in +either contexts of Page Objects and item classes. + +.. code-block:: python + + from web_poet import handle_urls, WebPage + + + class GenericProductPage(WebPage[Product]): + def to_item(self) -> Product: + return Product(product_title=self.css("title::text").get()) + + + @handle_urls("some.example", instead_of=GenericProductPage) + class ExampleProductPage(WebPage[Product]): + ... # more specific parsing + + + @handle_urls("another.example", instead_of=GenericProductPage, exclude="/digital-goods/") + class AnotherExampleProductPage(WebPage[Product]): + ... # more specific parsing + + + @handle_urls(["dual.example/shop/?product=*", "uk.dual.example/store/?pid=*"], instead_of=GenericProductPage) + class DualExampleProductPage(WebPage[SimilarProduct]): + ... # more specific parsing + +See the next :ref:`retrieving-overrides` section to observe what are the actual +:class:`~.ApplyRule` that were created by the ``@handle_urls`` decorators. + + +.. _retrieving-overrides: Retrieving all available Overrides ---------------------------------- -The :meth:`~.PageObjectRegistry.get_overrides` method from the ``web_poet.default_registry`` -allows retrieval of all :class:`~.OverrideRule` in the given registry. -Following from our example above, using it would be: +The :meth:`~.PageObjectRegistry.get_rules` method from the ``web_poet.default_registry`` +allows retrieval of all :class:`~.ApplyRule` in the given registry. +Following from our example above in the :ref:`combination` section, using it +would be: .. code-block:: python from web_poet import default_registry - # Retrieves all OverrideRules that were registered in the registry - rules = default_registry.get_overrides() + # Retrieves all ApplyRules that were registered in the registry + rules = default_registry.get_rules() - print(len(rules)) # 3 - print(rules[0]) # OverrideRule(for_patterns=Patterns(include=['example.com'], exclude=[], priority=500), use=<class 'my_project.page_objects.ExampleProductPage'>, instead_of=<class 'my_project.page_objects.GenericProductPage'>, meta={}) + for r in rules: + print(r) + # ApplyRule(for_patterns=Patterns(include=('some.example',), exclude=(), priority=500), use=<class 'ExampleProductPage'>, instead_of=<class 'GenericProductPage'>, to_return=<class 'Product'>, meta={}) + # ApplyRule(for_patterns=Patterns(include=('another.example',), exclude=('/digital-goods/',), priority=500), use=<class 'AnotherExampleProductPage'>, instead_of=<class 'GenericProductPage'>, to_return=<class 'Product'>, meta={}) + # ApplyRule(for_patterns=Patterns(include=('dual.example/shop/?product=*', 'uk.dual.example/store/?pid=*'), exclude=(), priority=500), use=<class 'DualExampleProductPage'>, instead_of=<class 'GenericProductPage'>, to_return=<class 'SimilarProduct'>, meta={}) Remember that using ``@handle_urls`` to annotate the Page Objects would result -in the :class:`~.OverrideRule` to be written into ``web_poet.default_registry``. +in the :class:`~.ApplyRule` to be written into ``web_poet.default_registry``. .. warning:: - :meth:`~.PageObjectRegistry.get_overrides` relies on the fact that all essential + :meth:`~.PageObjectRegistry.get_rules` relies on the fact that all essential packages/modules which contains the :func:`web_poet.handle_urls` - annotations are properly loaded `(i.e imported)`. + decorators are properly loaded `(i.e imported)`. Thus, for cases like importing and using Page Objects from other external packages, - the ``@handle_urls`` annotations from these external sources must be read and + the ``@handle_urls`` decorators from these external sources must be read and processed properly. This ensures that the external Page Objects have all of their - :class:`~.OverrideRule` present. + :class:`~.ApplyRule` present. This can be done via the function named :func:`~.web_poet.overrides.consume_modules`. Here's an example: @@ -145,7 +382,7 @@ in the :class:`~.OverrideRule` to be written into ``web_poet.default_registry``. from web_poet import default_registry, consume_modules consume_modules("external_package_A.po", "another_ext_package.lib") - rules = default_registry.get_overrides() + rules = default_registry.get_rules() The next section explores this caveat further. @@ -154,12 +391,12 @@ Using Overrides from External Packages -------------------------------------- Developers have the option to import existing Page Objects alongside the -:class:`~.OverrideRule` attached to them. This section aims to showcase different +:class:`~.ApplyRule` attached to them. This section aims to showcase different scenarios that come up when using multiple Page Object Projects. -.. _`intro-rule-all`: +.. _intro-rule-all: -Using all available OverrideRules from multiple Page Object Projects +Using all available ApplyRules from multiple Page Object Projects ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let's suppose we have the following use case before us: @@ -174,9 +411,9 @@ Let's suppose we have the following use case before us: - Thus, you'd want to use the already available packages above and perhaps improve on them or create new Page Objects for new websites. -Remember that all of the :class:`~.OverrideRule` are declared by annotating +Remember that all of the :class:`~.ApplyRule` are declared by annotating Page Objects using the :func:`web_poet.handle_urls` via ``@handle_urls``. Thus, -they can easily be accessed using the :meth:`~.PageObjectRegistry.get_overrides` +they can easily be accessed using the :meth:`~.PageObjectRegistry.get_rules` of ``web_poet.default_registry``. This can be done something like: @@ -186,21 +423,21 @@ This can be done something like: from web_poet import default_registry, consume_modules # ❌ Remember that this wouldn't retrieve any rules at all since the - # annotations are NOT properly imported. - rules = default_registry.get_overrides() + # ``@handle_urls`` decorators are NOT properly loaded. + rules = default_registry.get_rules() print(rules) # [] # ✅ Instead, you need to run the following so that all of the Page # Objects in the external packages are recursively imported. consume_modules("ecommerce_page_objects", "gadget_sites_page_objects") - rules = default_registry.get_overrides() + rules = default_registry.get_rules() # The collected rules would then be as follows: print(rules) - # 1. OverrideRule(for_patterns=Patterns(include=['site_1.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_1.EcomSite1'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={}) - # 2. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={}) - # 3. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, meta={}) - # 4. OverrideRule(for_patterns=Patterns(include=['site_3.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, meta={}) + # 1. ApplyRule(for_patterns=Patterns(include=['site_1.example'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_1.EcomSite1'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, to_return=None, meta={}) + # 2. ApplyRule(for_patterns=Patterns(include=['site_2.example'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, to_return=None, meta={}) + # 3. ApplyRule(for_patterns=Patterns(include=['site_2.example'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, to_return=None, meta={}) + # 4. ApplyRule(for_patterns=Patterns(include=['site_3.example'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, to_return=None, meta={}) .. note:: @@ -209,18 +446,18 @@ This can be done something like: runtime duration. Calling :func:`~.web_poet.overrides.consume_modules` again makes no difference unless a new set of modules are provided. -.. _`intro-rule-subset`: +.. _intro-rule-subset: -Using only a subset of the available OverrideRules -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Using only a subset of the available ApplyRules +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Suppose that the use case from the previous section has changed wherein a -subset of :class:`~.OverrideRule` would be used. This could be achieved by -using the :meth:`~.PageObjectRegistry.search_overrides` method which allows for +subset of :class:`~.ApplyRule` would be used. This could be achieved by +using the :meth:`~.PageObjectRegistry.search_rules` method which allows for convenient selection of a subset of rules from a given registry. Here's an example of how you could manually select the rules using the -:meth:`~.PageObjectRegistry.search_overrides` method instead: +:meth:`~.PageObjectRegistry.search_rules` method instead: .. code-block:: python @@ -229,50 +466,50 @@ Here's an example of how you could manually select the rules using the consume_modules("ecommerce_page_objects", "gadget_sites_page_objects") - ecom_rules = default_registry.search_overrides(instead_of=ecommerce_page_objects.EcomGenericPage) + ecom_rules = default_registry.search_rules(instead_of=ecommerce_page_objects.EcomGenericPage) print(ecom_rules) - # OverrideRule(for_patterns=Patterns(include=['site_1.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_1.EcomSite1'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={}) - # OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={}) + # ApplyRule(for_patterns=Patterns(include=['site_1.example'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_1.EcomSite1'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, to_return=None, meta={}) + # ApplyRule(for_patterns=Patterns(include=['site_2.example'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, to_return=None, meta={}) - gadget_rules = default_registry.search_overrides(use=gadget_sites_page_objects.site_3.GadgetSite3) + gadget_rules = default_registry.search_rules(use=gadget_sites_page_objects.site_3.GadgetSite3) print(gadget_rules) - # OverrideRule(for_patterns=Patterns(include=['site_3.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, meta={}) + # ApplyRule(for_patterns=Patterns(include=['site_3.example'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, to_return=None, meta={}) rules = ecom_rules + gadget_rules print(rules) - # OverrideRule(for_patterns=Patterns(include=['site_1.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_1.EcomSite1'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={}) - # OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={}) - # OverrideRule(for_patterns=Patterns(include=['site_3.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, meta={}) + # ApplyRule(for_patterns=Patterns(include=['site_1.example'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_1.EcomSite1'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, to_return=None, meta={}) + # ApplyRule(for_patterns=Patterns(include=['site_2.example'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, to_return=None, meta={}) + # ApplyRule(for_patterns=Patterns(include=['site_3.example'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, to_return=None, meta={}) -As you can see, using the :meth:`~.PageObjectRegistry.search_overrides` method allows you to -conveniently select for :class:`~.OverrideRule` which conform to a specific criteria. This -allows you to conveniently drill down to which :class:`~.OverrideRule` you're interested in +As you can see, using the :meth:`~.PageObjectRegistry.search_rules` method allows you to +conveniently select for :class:`~.ApplyRule` which conform to a specific criteria. This +allows you to conveniently drill down to which :class:`~.ApplyRule` you're interested in using. -.. _`overrides-custom-registry`: +.. _overrides-custom-registry: After gathering all the pre-selected rules, we can then store it in a new instance of :class:`~.PageObjectRegistry` in order to separate it from the ``default_registry`` -which contains all of the rules. We can use the :meth:`~.PageObjectRegistry.from_override_rules` +which contains all of the rules. We can use the :meth:`~.PageObjectRegistry.from_apply_rules` for this: .. code-block:: python from web_poet import PageObjectRegistry - my_new_registry = PageObjectRegistry.from_override_rules(rules) + my_new_registry = PageObjectRegistry.from_apply_rules(rules) -.. _`intro-improve-po`: +.. _intro-improve-po: Improving on external Page Objects ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -There would be cases wherein you're using Page Objects with :class:`~.OverrideRule` +There would be cases wherein you're using Page Objects with :class:`~.ApplyRule` from external packages only to find out that a few of them lacks some of the fields or features that you need. -Let's suppose that we wanted to use `all` of the :class:`~.OverrideRule` similar +Let's suppose that we wanted to use `all` of the :class:`~.ApplyRule` similar to this section: :ref:`intro-rule-all`. However, the ``EcomSite1`` Page Object needs to properly handle some edge cases where some fields are not being extracted properly. One way to fix this is to subclass the said Page Object and improve its @@ -285,32 +522,32 @@ have the first approach as an example: import ecommerce_page_objects, gadget_sites_page_objects consume_modules("ecommerce_page_objects", "gadget_sites_page_objects") - rules = default_registry.get_overrides() + rules = default_registry.get_rules() # The collected rules would then be as follows: print(rules) - # 1. OverrideRule(for_patterns=Patterns(include=['site_1.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_1.EcomSite1'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={}) - # 2. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={}) - # 3. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, meta={}) - # 4. OverrideRule(for_patterns=Patterns(include=['site_3.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, meta={}) + # 1. ApplyRule(for_patterns=Patterns(include=['site_1.example'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_1.EcomSite1'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, to_return=None, meta={}) + # 2. ApplyRule(for_patterns=Patterns(include=['site_2.example'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, to_return=None, meta={}) + # 3. ApplyRule(for_patterns=Patterns(include=['site_2.example'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, to_return=None, meta={}) + # 4. ApplyRule(for_patterns=Patterns(include=['site_3.example'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, to_return=None, meta={}) - @handle_urls("site_1.com", overrides=ecommerce_page_objects.EcomGenericPage, priority=1000) + @handle_urls("site_1.example", instead_of=ecommerce_page_objects.EcomGenericPage, priority=1000) class ImprovedEcomSite1(ecommerce_page_objects.site_1.EcomSite1): def to_item(self): ... # call super().to_item() and improve on the item's shortcomings - rules = default_registry.get_overrides() + rules = default_registry.get_rules() print(rules) - # 1. OverrideRule(for_patterns=Patterns(include=['site_1.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_1.EcomSite1'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={}) - # 2. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={}) - # 3. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, meta={}) - # 4. OverrideRule(for_patterns=Patterns(include=['site_3.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, meta={}) - # 5. OverrideRule(for_patterns=Patterns(include=['site_1.com'], exclude=[], priority=1000), use=<class 'my_project.ImprovedEcomSite1'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={}) + # 1. ApplyRule(for_patterns=Patterns(include=['site_1.example'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_1.EcomSite1'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, to_return=None, meta={}) + # 2. ApplyRule(for_patterns=Patterns(include=['site_2.example'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, to_return=None, meta={}) + # 3. ApplyRule(for_patterns=Patterns(include=['site_2.example'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, to_return=None, meta={}) + # 4. ApplyRule(for_patterns=Patterns(include=['site_3.example'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, to_return=None, meta={}) + # 5. ApplyRule(for_patterns=Patterns(include=['site_1.example'], exclude=[], priority=1000), use=<class 'my_project.ImprovedEcomSite1'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, to_return=None, meta={}) -Notice that we're adding a new :class:`~.OverrideRule` for the same URL pattern -for ``site_1.com``. +Notice that we're adding a new :class:`~.ApplyRule` for the same URL pattern +for ``site_1.example``. -When the time comes that a Page Object needs to be selected when parsing ``site_1.com`` +When the time comes that a Page Object needs to be selected when parsing ``site_1.example`` and it needs to replace ``ecommerce_page_objects.EcomGenericPage``, rules **#1** and **#5** will be the choices. However, since we've assigned a much **higher priority** for the new rule in **#5** than the default ``500`` value, rule **#5** will be @@ -324,7 +561,7 @@ Handling conflicts from using Multiple External Packages -------------------------------------------------------- You might've observed from the previous section that retrieving the list of all -:class:`~.OverrideRule` from two different external packages may result in a +:class:`~.ApplyRule` from two different external packages may result in a conflict. We can take a look at the rules for **#2** and **#3** when we were importing all @@ -332,39 +569,39 @@ available rules: .. code-block:: python - # 2. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={}) - # 3. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, meta={}) + # 2. ApplyRule(for_patterns=Patterns(include=['site_2.example'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, to_return=None, meta={}) + # 3. ApplyRule(for_patterns=Patterns(include=['site_2.example'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, to_return=None, meta={}) However, it's technically **NOT** a `conflict`, **yet**, since: - - ``ecommerce_page_objects.site_2.EcomSite2`` would only be used in **site_2.com** + - ``ecommerce_page_objects.site_2.EcomSite2`` would only be used in **site_2.example** if ``ecommerce_page_objects.EcomGenericPage`` is to be replaced. - The same case with ``gadget_sites_page_objects.site_2.GadgetSite2`` wherein - it's only going to be utilized for **site_2.com** if the following is to be + it's only going to be utilized for **site_2.example** if the following is to be replaced: ``gadget_sites_page_objects.GadgetGenericPage``. -It would be only become a conflict if both rules for **site_2.com** `intend to +It would be only become a conflict if both rules for **site_2.example** `intend to replace the` **same** `Page Object`. -However, let's suppose that there are some :class:`~.OverrideRule` which actually +However, let's suppose that there are some :class:`~.ApplyRule` which actually result in a conflict. To give an example, let's suppose that rules **#2** and **#3** `intends to replace the` **same** `Page Object`. It would look something like: .. code-block:: python - # 2. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'common_items.ProductGenericPage'>, meta={}) - # 3. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'common_items.ProductGenericPage'>, meta={}) + # 2. ApplyRule(for_patterns=Patterns(include=['site_2.example'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'common_items.ProductGenericPage'>, to_return=None, meta={}) + # 3. ApplyRule(for_patterns=Patterns(include=['site_2.example'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'common_items.ProductGenericPage'>, to_return=None, meta={}) Notice that the ``instead_of`` param are the same and only the ``use`` param remained different. There are two main ways we recommend in solving this. -.. _`priority-resolution`: +.. _priority-resolution: **1. Priority Resolution** -If you notice, the ``for_patterns`` attribute of :class:`~.OverrideRule` is an +If you notice, the ``for_patterns`` attribute of :class:`~.ApplyRule` is an instance of `url_matcher.Patterns <https://url-matcher.readthedocs.io/en/stable/api_reference.html#module-url-matcher>`_. This instance also has a ``priority`` param where a higher value will be chosen @@ -378,18 +615,18 @@ in times of conflict. Unfortunately, updating the ``priority`` value directly isn't possible as the :class:`url_matcher.Patterns` is a **frozen** `dataclass`. The same is true for -:class:`~.OverrideRule`. This is made by design so that they are hashable and could +:class:`~.ApplyRule`. This is made by design so that they are hashable and could be deduplicated immediately without consequences of them changing in value. The only way that the ``priority`` value can be changed is by creating a new -:class:`~.OverrideRule` with a different ``priority`` value (`higher if it needs +:class:`~.ApplyRule` with a different ``priority`` value (`higher if it needs more priority`). You don't necessarily need to `delete` the **old** -:class:`~.OverrideRule` since they will be resolved via ``priority`` anyways. +:class:`~.ApplyRule` since they will be resolved via ``priority`` anyways. -Creating a new :class:`~.OverrideRule` with a higher priority could be as easy as: +Creating a new :class:`~.ApplyRule` with a higher priority could be as easy as: 1. Subclassing the Page Object in question. - 2. Create a new :func:`web_poet.handle_urls` annotation with the same URL + 2. Declare a new :func:`web_poet.handle_urls` decorator with the same URL pattern and Page Object to override but with a much higher priority. Here's an example: @@ -399,20 +636,20 @@ Here's an example: from web_poet import default_registry, consume_modules, handle_urls import ecommerce_page_objects, gadget_sites_page_objects, common_items - @handle_urls("site_2.com", overrides=common_items.ProductGenericPage, priority=1000) + @handle_urls("site_2.example", instead_of=common_items.ProductGenericPage, priority=1000) class EcomSite2Copy(ecommerce_page_objects.site_1.EcomSite1): def to_item(self): return super().to_item() Now, the conflicting **#2** and **#3** rules would never be selected because of -the new :class:`~.OverrideRule` having a much higher priority (see rule **#4**): +the new :class:`~.ApplyRule` having a much higher priority (see rule **#4**): .. code-block:: python - # 2. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'common_items.ProductGenericPage'>, meta={}) - # 3. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'common_items.ProductGenericPage'>, meta={}) + # 2. ApplyRule(for_patterns=Patterns(include=['site_2.example'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'common_items.ProductGenericPage'>, to_return=None, meta={}) + # 3. ApplyRule(for_patterns=Patterns(include=['site_2.example'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'common_items.ProductGenericPage'>, to_return=None, meta={}) - # 4. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=1000), use=<class 'my_project.EcomSite2Copy'>, instead_of=<class 'common_items.ProductGenericPage'>, meta={}) + # 4. ApplyRule(for_patterns=Patterns(include=['site_2.example'], exclude=[], priority=1000), use=<class 'my_project.EcomSite2Copy'>, instead_of=<class 'common_items.ProductGenericPage'>, to_return=None, meta={}) A similar idea was also discussed in the :ref:`intro-improve-po` section. @@ -420,7 +657,7 @@ A similar idea was also discussed in the :ref:`intro-improve-po` section. **2. Specifically Selecting the Rules** When the last resort of ``priority``-resolution doesn't work, then you could always -specifically select the list of :class:`~.OverrideRule` you want to use. +specifically select the list of :class:`~.ApplyRule` you want to use. We **recommend** in creating an **inclusion**-list rather than an **exclusion**-list since the latter is quite brittle. For instance, an external package you're using @@ -429,8 +666,8 @@ were recently added. This could lead to a `silent-error` of receiving a differen set of rules than expected. This **inclusion**-list approach can be done by importing the Page Objects directly -and creating instances of :class:`~.OverrideRule` from it. You could also import -all of the available :class:`~.OverrideRule` using :meth:`~.PageObjectRegistry.get_overrides` +and creating instances of :class:`~.ApplyRule` from it. You could also import +all of the available :class:`~.ApplyRule` using :meth:`~.PageObjectRegistry.get_rules` to sift through the list of available rules and manually selecting the rules you need. Most of the time, the needed rules are the ones which uses the Page Objects we're @@ -445,14 +682,14 @@ easily find the Page Object's rule using its `key`. Here's an example: consume_modules("package_A", "package_B", "package_C") rules = [ - default_registry[package_A.PageObject1], # OverrideRule(for_patterns=Patterns(include=['site_A.com'], exclude=[], priority=500), use=<class 'package_A.PageObject1'>, instead_of=<class 'GenericPage'>, meta={}) - default_registry[package_B.PageObject2], # OverrideRule(for_patterns=Patterns(include=['site_B.com'], exclude=[], priority=500), use=<class 'package_B.PageObject2'>, instead_of=<class 'GenericPage'>, meta={}) - default_registry[package_C.PageObject3], # OverrideRule(for_patterns=Patterns(include=['site_C.com'], exclude=[], priority=500), use=<class 'package_C.PageObject3'>, instead_of=<class 'GenericPage'>, meta={}) + default_registry[package_A.PageObject1], # ApplyRule(for_patterns=Patterns(include=['site_A.example'], exclude=[], priority=500), use=<class 'package_A.PageObject1'>, instead_of=<class 'GenericPage'>, to_return=None, meta={}) + default_registry[package_B.PageObject2], # ApplyRule(for_patterns=Patterns(include=['site_B.example'], exclude=[], priority=500), use=<class 'package_B.PageObject2'>, instead_of=<class 'GenericPage'>, to_return=None, meta={}) + default_registry[package_C.PageObject3], # ApplyRule(for_patterns=Patterns(include=['site_C.example'], exclude=[], priority=500), use=<class 'package_C.PageObject3'>, instead_of=<class 'GenericPage'>, to_return=None, meta={}) ] -Another approach would be using the :meth:`~.PageObjectRegistry.search_overrides` +Another approach would be using the :meth:`~.PageObjectRegistry.search_rules` functionality as described from this tutorial section: :ref:`intro-rule-subset`. -The :meth:`~.PageObjectRegistry.search_overrides` is quite useful in cases wherein +The :meth:`~.PageObjectRegistry.search_rules` is quite useful in cases wherein the **POP** contains a lot of rules as it presents a utility for programmatically searching for them. @@ -466,19 +703,19 @@ Here's an example: consume_modules("package_A", "package_B", "package_C") - rule_from_A = default_registry.search_overrides(use=package_A.PageObject1) + rule_from_A = default_registry.search_rules(use=package_A.PageObject1) print(rule_from_A) - # [OverrideRule(for_patterns=Patterns(include=['site_A.com'], exclude=[], priority=500), use=<class 'package_A.PageObject1'>, instead_of=<class 'GenericPage'>, meta={})] + # [ApplyRule(for_patterns=Patterns(include=['site_A.example'], exclude=[], priority=500), use=<class 'package_A.PageObject1'>, instead_of=<class 'GenericPage'>, to_return=None, meta={})] - rule_from_B = default_registry.search_overrides(instead_of=GenericProductPage) + rule_from_B = default_registry.search_rules(instead_of=GenericProductPage) print(rule_from_B) # [] - rule_from_C = default_registry.search_overrides(for_patterns=Patterns(include=["site_C.com"])) + rule_from_C = default_registry.search_rules(for_patterns=Patterns(include=["site_C.example"])) print(rule_from_C) # [ - # OverrideRule(for_patterns=Patterns(include=['site_C.com'], exclude=[], priority=500), use=<class 'package_C.PageObject3'>, instead_of=<class 'GenericPage'>, meta={}), - # OverrideRule(for_patterns=Patterns(include=['site_C.com'], exclude=[], priority=1000), use=<class 'package_C.PageObject3_improved'>, instead_of=<class 'GenericPage'>, meta={}) + # ApplyRule(for_patterns=Patterns(include=['site_C.example'], exclude=[], priority=500), use=<class 'package_C.PageObject3'>, instead_of=<class 'GenericPage'>, to_return=None, meta={}), + # ApplyRule(for_patterns=Patterns(include=['site_C.example'], exclude=[], priority=1000), use=<class 'package_C.PageObject3_improved'>, instead_of=<class 'GenericPage'>, to_return=None, meta={}) # ] rules = rule_from_A + rule_from_B + rule_from_C diff --git a/docs/intro/tutorial.rst b/docs/intro/tutorial.rst index ab30bbbb..2bf72ec4 100644 --- a/docs/intro/tutorial.rst +++ b/docs/intro/tutorial.rst @@ -1,4 +1,4 @@ -.. _`intro-tutorial`: +.. _intro-tutorial: ===================== web-poet on a surface diff --git a/docs/license.rst b/docs/license.rst index e6a41ca8..e647e180 100644 --- a/docs/license.rst +++ b/docs/license.rst @@ -1,4 +1,4 @@ -.. _`license`: +.. _license: ======= License diff --git a/docs/requirements.txt b/docs/requirements.txt index 5c5d4d38..0ff12812 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,2 +1,2 @@ -Sphinx==5.0.1 +Sphinx==5.3.0 sphinx-rtd-theme==1.0.0 diff --git a/pyproject.toml b/pyproject.toml index 892b5aab..efea5ff0 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -6,3 +6,9 @@ multi_line_output = 3 show_error_codes = true ignore_missing_imports = true no_warn_no_return = true + +[[tool.mypy.overrides]] +module = "tests.po_lib_to_return.*" +# Ignore mypy errors since the Page Objects contain arbitrary reference values +# used for assertions which have varying types. This upsets mypy. +ignore_errors = true diff --git a/tests/po_lib/__init__.py b/tests/po_lib/__init__.py index b5545b60..91ae0036 100644 --- a/tests/po_lib/__init__.py +++ b/tests/po_lib/__init__.py @@ -12,8 +12,9 @@ class POBase(ItemPage): - expected_overrides: Type[ItemPage] + expected_instead_of: Type[ItemPage] expected_patterns: Patterns + expected_to_return: Any = None expected_meta: Dict[str, Any] @@ -25,19 +26,22 @@ class POTopLevelOverriden2(ItemPage): ... -# This first annotation is ignored. A single annotation per registry is allowed -@handle_urls("example.com", overrides=POTopLevelOverriden1) +# This first decorator is ignored. A single ``ApplyRule`` with the same Page +# Object to be used per registry is allowed. +@handle_urls("example.com", instead_of=POTopLevelOverriden1) @handle_urls( - "example.com", overrides=POTopLevelOverriden1, exclude="/*.jpg|", priority=300 + "example.com", instead_of=POTopLevelOverriden1, exclude="/*.jpg|", priority=300 ) class POTopLevel1(POBase): - expected_overrides = POTopLevelOverriden1 + expected_instead_of = POTopLevelOverriden1 expected_patterns = Patterns(["example.com"], ["/*.jpg|"], priority=300) + expected_to_return = None expected_meta = {} # type: ignore -@handle_urls("example.com", overrides=POTopLevelOverriden2) +@handle_urls("example.com", instead_of=POTopLevelOverriden2) class POTopLevel2(POBase): - expected_overrides = POTopLevelOverriden2 + expected_instead_of = POTopLevelOverriden2 expected_patterns = Patterns(["example.com"]) + expected_to_return = None expected_meta = {} # type: ignore diff --git a/tests/po_lib/a_module.py b/tests/po_lib/a_module.py index c3e7810d..941f48f6 100644 --- a/tests/po_lib/a_module.py +++ b/tests/po_lib/a_module.py @@ -8,8 +8,9 @@ class POModuleOverriden(ItemPage): ... -@handle_urls("example.com", overrides=POModuleOverriden, extra_arg="foo") +@handle_urls("example.com", instead_of=POModuleOverriden, extra_arg="foo") class POModule(POBase): - expected_overrides = POModuleOverriden + expected_instead_of = POModuleOverriden expected_patterns = Patterns(["example.com"]) + expected_to_return = None expected_meta = {"extra_arg": "foo"} # type: ignore diff --git a/tests/po_lib/nested_package/__init__.py b/tests/po_lib/nested_package/__init__.py index 64be2384..63547ea4 100644 --- a/tests/po_lib/nested_package/__init__.py +++ b/tests/po_lib/nested_package/__init__.py @@ -11,9 +11,10 @@ class PONestedPkgOverriden(ItemPage): @handle_urls( include=["example.com", "example.org"], exclude=["/*.jpg|"], - overrides=PONestedPkgOverriden, + instead_of=PONestedPkgOverriden, ) class PONestedPkg(POBase): - expected_overrides = PONestedPkgOverriden + expected_instead_of = PONestedPkgOverriden expected_patterns = Patterns(["example.com", "example.org"], ["/*.jpg|"]) + expected_to_return = None expected_meta = {} # type: ignore diff --git a/tests/po_lib/nested_package/a_nested_module.py b/tests/po_lib/nested_package/a_nested_module.py index fc2e837e..5d44b39c 100644 --- a/tests/po_lib/nested_package/a_nested_module.py +++ b/tests/po_lib/nested_package/a_nested_module.py @@ -11,11 +11,12 @@ class PONestedModuleOverriden(ItemPage): @handle_urls( include=["example.com", "example.org"], exclude=["/*.jpg|"], - overrides=PONestedModuleOverriden, + instead_of=PONestedModuleOverriden, ) class PONestedModule(POBase): - expected_overrides = PONestedModuleOverriden + expected_instead_of = PONestedModuleOverriden expected_patterns = Patterns( include=["example.com", "example.org"], exclude=["/*.jpg|"] ) + expected_to_return = None expected_meta = {} # type: ignore diff --git a/tests/po_lib_sub/__init__.py b/tests/po_lib_sub/__init__.py index 442836c3..50ffd99b 100644 --- a/tests/po_lib_sub/__init__.py +++ b/tests/po_lib_sub/__init__.py @@ -9,7 +9,7 @@ class POBase(ItemPage): - expected_overrides: Type[ItemPage] + expected_instead_of: Type[ItemPage] expected_patterns: Patterns expected_meta: Dict[str, Any] @@ -18,8 +18,9 @@ class POLibSubOverriden(ItemPage): ... -@handle_urls("sub_example.com", overrides=POLibSubOverriden) +@handle_urls("sub.example", instead_of=POLibSubOverriden) class POLibSub(POBase): - expected_overrides = POLibSubOverriden - expected_patterns = Patterns(["sub_example.com"]) + expected_instead_of = POLibSubOverriden + expected_patterns = Patterns(["sub.example"]) + expected_to_return = None expected_meta = {} # type: ignore diff --git a/tests/po_lib_to_return/__init__.py b/tests/po_lib_to_return/__init__.py new file mode 100644 index 00000000..935e974e --- /dev/null +++ b/tests/po_lib_to_return/__init__.py @@ -0,0 +1,196 @@ +import attrs +from url_matcher import Patterns + +from web_poet import Injectable, ItemPage, Returns, field, handle_urls, item_from_fields + + +@attrs.define +class Product: + name: str + price: float + + +@attrs.define +class ProductSeparate: + name: str + price: float + + +@attrs.define +class ProductSimilar: + name: str + price: float + + +@attrs.define +class ProductMoreFields(Product): + brand: str + + +@attrs.define +class ProductFewerFields: + name: str + + +@handle_urls("example.com") +class SomePage(ItemPage): + """A PO which is only marked by the URL pattern.""" + + expected_instead_of = None + expected_patterns = Patterns(["example.com"]) + expected_to_return = None + expected_meta = {} + + @field + def name(self) -> str: + return "some name" + + +@handle_urls("example.com") +class ProductPage(ItemPage[Product]): + """A base PO to populate the Product item's fields.""" + + expected_instead_of = None + expected_patterns = Patterns(["example.com"]) + expected_to_return = Product + expected_meta = {} + + @field + def name(self) -> str: + return "name" + + @field + def price(self) -> float: + return 12.99 + + +@handle_urls("example.com", instead_of=ProductPage) +class ImprovedProductPage(ProductPage): + """A custom PO inheriting from a base PO which alters some field values.""" + + expected_instead_of = ProductPage + expected_patterns = Patterns(["example.com"]) + expected_to_return = Product + expected_meta = {} + + @field + def name(self) -> str: + return "improved name" + + +@handle_urls("example.com", instead_of=ProductPage) +class SeparateProductPage(ItemPage[ProductSeparate]): + """Same case as with ``ImprovedProductPage`` but it doesn't inherit from + ``ProductPage``. + """ + + expected_instead_of = ProductPage + expected_patterns = Patterns(["example.com"]) + expected_to_return = ProductSeparate + expected_meta = {} + + @field + def name(self) -> str: + return "separate name" + + +@handle_urls("example.com", instead_of=ProductPage) +class SimilarProductPage(ProductPage, Returns[ProductSimilar]): + """A custom PO inheriting from a base PO returning the same fields but in + a different item class. + """ + + expected_instead_of = ProductPage + expected_patterns = Patterns(["example.com"]) + expected_to_return = ProductSimilar + expected_meta = {} + + +@handle_urls("example.com", instead_of=ProductPage) +class MoreProductPage(ProductPage, Returns[ProductMoreFields]): + """A custom PO inheriting from a base PO returning more items using a + different item class. + """ + + expected_instead_of = ProductPage + expected_patterns = Patterns(["example.com"]) + expected_to_return = ProductMoreFields + expected_meta = {} + + @field + def brand(self) -> str: + return "brand" + + +@handle_urls("example.com", instead_of=ProductPage) +class LessProductPage( + ProductPage, Returns[ProductFewerFields], skip_nonitem_fields=True +): + """A custom PO inheriting from a base PO returning less items using a + different item class. + """ + + expected_instead_of = ProductPage + expected_patterns = Patterns(["example.com"]) + expected_to_return = ProductFewerFields + expected_meta = {} + + @field + def brand(self) -> str: + return "brand" + + +@handle_urls("example.com", instead_of=ProductPage, to_return=ProductSimilar) +class CustomProductPage(ProductPage, Returns[Product]): + """A custom PO inheriting from a base PO returning the same fields but in + a different item class. + + This PO is the same with ``SimilarProductPage`` but passes a ``to_return`` + in the ``@handle_urls`` decorator. + + This tests the case that the type passed via the ``to_return`` parameter + from ``@handle_urls`` takes priority. + """ + + expected_instead_of = ProductPage + expected_patterns = Patterns(["example.com"]) + expected_to_return = ProductSimilar + expected_meta = {} + + +@handle_urls("example.com", instead_of=ProductPage, to_return=ProductSimilar) +class CustomProductPageNoReturns(ProductPage): + """Same case as with ``CustomProductPage`` but doesn't inherit from + ``Returns[Product]``. + """ + + expected_instead_of = ProductPage + expected_patterns = Patterns(["example.com"]) + expected_to_return = ProductSimilar + expected_meta = {} + + +@handle_urls("example.com", to_return=Product) +class CustomProductPageDataTypeOnly(Injectable): + """A PO that doesn't inherit from ``ItemPage`` and ``WebPage`` which means + it doesn't inherit from the ``Returns`` class. + + This tests the case that the ``to_return`` parameter in ``@handle_urls`` + should properly use it in the rules. + """ + + expected_instead_of = None + expected_patterns = Patterns(["example.com"]) + expected_to_return = Product + expected_meta = {} + + @field + def name(self) -> str: + return "name" + + @field + def price(self) -> float: + return 12.99 + + async def to_item(self) -> Product: + return await item_from_fields(self, item_cls=Product) diff --git a/tests/test_fields.py b/tests/test_fields.py index 86e75c47..731152cc 100644 --- a/tests/test_fields.py +++ b/tests/test_fields.py @@ -4,6 +4,20 @@ import attrs import pytest +from tests.po_lib_to_return import ( + CustomProductPage, + CustomProductPageDataTypeOnly, + CustomProductPageNoReturns, + ImprovedProductPage, + LessProductPage, + MoreProductPage, + Product, + ProductFewerFields, + ProductMoreFields, + ProductPage, + ProductSimilar, + SimilarProductPage, +) from web_poet import ( HttpResponse, Injectable, @@ -370,6 +384,46 @@ def field_foo_cached(self): assert page.field_foo_cached == "foo" +@pytest.mark.asyncio +async def test_field_with_handle_urls() -> None: + + page = ProductPage() + assert page.name == "name" + assert page.price == 12.99 + assert await page.to_item() == Product(name="name", price=12.99) + + page = ImprovedProductPage() + assert page.name == "improved name" + assert page.price == 12.99 + assert await page.to_item() == Product(name="improved name", price=12.99) + + page = SimilarProductPage() + assert page.name == "name" + assert page.price == 12.99 + assert await page.to_item() == ProductSimilar(name="name", price=12.99) + + page = MoreProductPage() + assert page.name == "name" + assert page.price == 12.99 + assert page.brand == "brand" + assert await page.to_item() == ProductMoreFields( + name="name", price=12.99, brand="brand" + ) + + page = LessProductPage() + assert page.name == "name" + assert await page.to_item() == ProductFewerFields(name="name") + + for page in [ # type: ignore[assignment] + CustomProductPage(), + CustomProductPageNoReturns(), + CustomProductPageDataTypeOnly(), + ]: + assert page.name == "name" + assert page.price == 12.99 + assert await page.to_item() == Product(name="name", price=12.99) + + def test_field_processors_sync() -> None: def proc1(s): return s + "x" diff --git a/tests/test_overrides.py b/tests/test_overrides.py deleted file mode 100644 index 2a9d09f5..00000000 --- a/tests/test_overrides.py +++ /dev/null @@ -1,104 +0,0 @@ -import pytest -from url_matcher import Patterns - -from tests.po_lib import POTopLevel1, POTopLevel2, POTopLevelOverriden2 -from tests.po_lib.a_module import POModule, POModuleOverriden -from tests.po_lib.nested_package import PONestedPkg -from tests.po_lib.nested_package.a_nested_module import PONestedModule -from tests.po_lib_sub import POLibSub -from web_poet import OverrideRule, PageObjectRegistry, consume_modules, default_registry - -POS = {POTopLevel1, POTopLevel2, POModule, PONestedPkg, PONestedModule} - - -def test_override_rule_uniqueness() -> None: - """The same instance of an OverrideRule with the same attribute values should - have the same hash identity. - """ - - patterns = Patterns(include=["example.com"], exclude=["example.com/blog"]) - - rule1 = OverrideRule( - for_patterns=patterns, - use=POTopLevel1, - instead_of=POTopLevelOverriden2, - meta={"key_1": 1}, - ) - rule2 = OverrideRule( - for_patterns=patterns, - use=POTopLevel1, - instead_of=POTopLevelOverriden2, - meta={"key_2": 2}, - ) - - assert hash(rule1) == hash(rule2) - - -def test_list_page_objects_all() -> None: - rules = default_registry.get_overrides() - page_objects = {po.use for po in rules} - - # Note that the 'tests_extra.po_lib_sub_not_imported.POLibSubNotImported' - # Page Object is not included here since it was never imported anywhere in - # our test package. It would only be included if we run any of the following - # below. (Note that they should run before `get_overrides` is called.) - # - from tests_extra import po_lib_sub_not_imported - # - import tests_extra.po_lib_sub_not_imported - # - web_poet.consume_modules("tests_extra") - # Merely having `import tests_extra` won't work since the subpackages and - # modules needs to be traversed and imported as well. - assert all(["po_lib_sub_not_imported" not in po.__module__ for po in page_objects]) - - # Ensure that ALL Override Rules are returned as long as the given - # registry's @handle_urls annotation was used. - assert page_objects == POS.union({POLibSub}) - for rule in rules: - # We're ignoring the types below since mypy expects ``Type[ItemPage]`` - # which doesn't contain the ``expected_*`` fields in our tests. - assert rule.instead_of == rule.use.expected_overrides, rule.use # type: ignore[attr-defined] - assert rule.for_patterns == rule.use.expected_patterns, rule.use # type: ignore[attr-defined] - assert rule.meta == rule.use.expected_meta, rule.use # type: ignore[attr-defined] - - -def test_consume_module_not_existing() -> None: - with pytest.raises(ImportError): - consume_modules("this_does_not_exist") - - -def test_list_page_objects_all_consume() -> None: - """A test similar to the one above but calls ``consume_modules()`` to properly - load the @handle_urls annotations from other modules/packages. - """ - consume_modules("tests_extra") - rules = default_registry.get_overrides() - page_objects = {po.use for po in rules} - assert any(["po_lib_sub_not_imported" in po.__module__ for po in page_objects]) - - -def test_registry_search_overrides() -> None: - rules = default_registry.search_overrides(use=POTopLevel2) - assert len(rules) == 1 - assert rules[0].use == POTopLevel2 - - rules = default_registry.search_overrides(instead_of=POTopLevelOverriden2) - assert len(rules) == 1 - assert rules[0].instead_of == POTopLevelOverriden2 - - # Such rules doesn't exist - rules = default_registry.search_overrides(use=POModuleOverriden) - assert len(rules) == 0 - - -def test_from_override_rules() -> None: - rules = [ - OverrideRule( - for_patterns=Patterns(include=["sample.com"]), - use=POTopLevel1, - instead_of=POTopLevelOverriden2, - ) - ] - - registry = PageObjectRegistry.from_override_rules(rules) - - assert registry.get_overrides() == rules - assert default_registry.get_overrides() != rules diff --git a/tests/test_rules.py b/tests/test_rules.py new file mode 100644 index 00000000..cdebe78b --- /dev/null +++ b/tests/test_rules.py @@ -0,0 +1,366 @@ +import attrs +import pytest +from url_matcher import Patterns + +from tests.po_lib import ( + POTopLevel1, + POTopLevel2, + POTopLevelOverriden1, + POTopLevelOverriden2, +) +from tests.po_lib.a_module import POModule, POModuleOverriden +from tests.po_lib.nested_package import PONestedPkg +from tests.po_lib.nested_package.a_nested_module import PONestedModule +from tests.po_lib_sub import POLibSub +from tests.po_lib_to_return import ( + CustomProductPage, + CustomProductPageDataTypeOnly, + CustomProductPageNoReturns, + ImprovedProductPage, + LessProductPage, + MoreProductPage, + Product, + ProductPage, + ProductSimilar, + SeparateProductPage, + SimilarProductPage, + SomePage, +) +from web_poet import ( + ApplyRule, + OverrideRule, + PageObjectRegistry, + consume_modules, + default_registry, + handle_urls, +) + +POS = { + CustomProductPage, + CustomProductPageNoReturns, + CustomProductPageDataTypeOnly, + ImprovedProductPage, + LessProductPage, + MoreProductPage, + POTopLevel1, + POTopLevel2, + POModule, + PONestedPkg, + PONestedModule, + ProductPage, + SeparateProductPage, + SimilarProductPage, + SomePage, +} + + +def test_apply_rule_uniqueness() -> None: + """The same instance of an ApplyRule with the same attribute values should + have the same hash identity. + """ + + patterns = Patterns(include=["example.com"], exclude=["example.com/blog"]) + patterns_b = Patterns(include=["example.com/b"]) + + rule1 = ApplyRule( + for_patterns=patterns, + use=POTopLevel1, + instead_of=POTopLevelOverriden1, + meta={"key_1": 1}, + ) + rule2 = ApplyRule( + for_patterns=patterns, + use=POTopLevel1, + instead_of=POTopLevelOverriden1, + meta={"key_2": 2}, + ) + # The ``meta`` parameter is ignored in the hash. + assert hash(rule1) == hash(rule2) + + params = [ + { + "for_patterns": patterns, + "use": POTopLevel1, + "instead_of": POTopLevelOverriden1, + "to_return": Product, + }, + { + "for_patterns": patterns_b, + "use": POTopLevel2, + "instead_of": POTopLevelOverriden2, + "to_return": ProductSimilar, + }, + ] + + for change in params[0].keys(): + # Changing any one of the params should result in a hash mismatch + rule1 = ApplyRule(**params[0]) + kwargs = params[0].copy() + kwargs.update({change: params[1][change]}) + rule2 = ApplyRule(**kwargs) + assert hash(rule1) != hash(rule2) + + +def test_apply_rule_immutability() -> None: + patterns = Patterns(include=["example.com"], exclude=["example.com/blog"]) + + rule = ApplyRule( + for_patterns=patterns, + use=POTopLevel1, + instead_of=POTopLevelOverriden1, + ) + + with pytest.raises(attrs.exceptions.FrozenInstanceError): + rule.for_patterns = Patterns(include=["example.com/"]) # type: ignore[misc] + + with pytest.raises(attrs.exceptions.FrozenInstanceError): + rule.use = POTopLevel2 # type: ignore[misc] + + with pytest.raises(attrs.exceptions.FrozenInstanceError): + rule.instead_of = POTopLevelOverriden2 # type: ignore[misc] + + +def test_apply_rule_converter_on_pattern() -> None: + # passing strings should auto-converter into Patterns + rule = ApplyRule("example.com", use=POTopLevel1, instead_of=POTopLevelOverriden2) + assert rule.for_patterns == Patterns( + include=("example.com",), exclude=(), priority=500 + ) + + # Passing Patterns should still work + rule = ApplyRule( + for_patterns=Patterns(["example.com"]), + use=POTopLevel1, + instead_of=POTopLevelOverriden2, + ) + assert rule.for_patterns == Patterns( + include=("example.com",), exclude=(), priority=500 + ) + + +def test_apply_rule_kwargs_only() -> None: + + params = { + "use": POTopLevel1, + "instead_of": POTopLevelOverriden2, + "to_return": Product, + "meta": {"key_2": 2}, + } + remove = set() + + for param_name in params: + remove.add(param_name) + with pytest.raises(TypeError): + ApplyRule( + "example.com", + *[params[r] for r in remove], + **{k: v for k, v in params.items() if k not in remove} # type: ignore[arg-type] + ) + + +def test_list_page_objects_all() -> None: + rules = default_registry.get_rules() + page_objects = {po.use for po in rules} + + # Note that the 'tests_extra.po_lib_sub_not_imported.POLibSubNotImported' + # Page Object is not included here since it was never imported anywhere in + # our test package. It would only be included if we run any of the following + # below. (Note that they should run before `get_rules` is called.) + # - from tests_extra import po_lib_sub_not_imported + # - import tests_extra.po_lib_sub_not_imported + # - web_poet.consume_modules("tests_extra") + # Merely having `import tests_extra` won't work since the subpackages and + # modules needs to be traversed and imported as well. + assert all(["po_lib_sub_not_imported" not in po.__module__ for po in page_objects]) + + # Ensure that ALL Override Rules are returned as long as the given + # registry's @handle_urls decorator was used. + assert page_objects == POS.union({POLibSub}) + for rule in rules: + # We're ignoring the types below since mypy expects ``Type[ItemPage]`` + # which doesn't contain the ``expected_*`` fields in our tests. + assert rule.instead_of == rule.use.expected_instead_of, rule.use # type: ignore[attr-defined] + assert rule.for_patterns == rule.use.expected_patterns, rule.use # type: ignore[attr-defined] + assert rule.to_return == rule.use.expected_to_return, rule.use # type: ignore[attr-defined] + assert rule.meta == rule.use.expected_meta, rule.use # type: ignore[attr-defined] + + +def test_registry_get_overrides_deprecation() -> None: + msg = "The 'get_overrides' method is deprecated. Use 'get_rules' instead." + with pytest.warns(DeprecationWarning, match=msg): + rules = default_registry.get_overrides() + + # It should still work as usual + assert len(rules) == len(default_registry.get_rules()) + + # but the rules from ``.get_overrides()`` should return ``ApplyRule`` and + # not the old ``OverrideRule``. + assert all([r for r in rules if isinstance(r, ApplyRule)]) + + +def test_consume_module_not_existing() -> None: + with pytest.raises(ImportError): + consume_modules("this_does_not_exist") + + +def test_list_page_objects_all_consume() -> None: + """A test similar to the one above but calls ``consume_modules()`` to properly + load the ``@handle_urls`` decorators from other modules/packages. + """ + consume_modules("tests_extra") + rules = default_registry.get_rules() + page_objects = {po.use for po in rules} + assert any(["po_lib_sub_not_imported" in po.__module__ for po in page_objects]) + + +def test_registry_search_rules() -> None: + # param: use + rules = default_registry.search_rules(use=POTopLevel2) + assert len(rules) == 1 + assert rules[0].use == POTopLevel2 + + # param: instead_of + rules = default_registry.search_rules(instead_of=POTopLevelOverriden2) + assert len(rules) == 1 + assert rules[0].instead_of == POTopLevelOverriden2 + + # param: to_return + rules = default_registry.search_rules(to_return=Product) + assert rules == [ + ApplyRule("example.com", use=ProductPage, to_return=Product), + ApplyRule( + "example.com", + use=ImprovedProductPage, + instead_of=ProductPage, + to_return=Product, + ), + ApplyRule( + "example.com", + # mypy complains here since it's expecting a container class when + # declared, i.e, ``ItemPage[SomeItem]`` + use=CustomProductPageDataTypeOnly, # type: ignore[arg-type] + to_return=Product, + ), + ] + + # params: to_return and use + rules = default_registry.search_rules(to_return=Product, use=ImprovedProductPage) + assert len(rules) == 1 + assert rules[0].to_return == Product + assert rules[0].use == ImprovedProductPage + + # Such rules doesn't exist + rules = default_registry.search_rules(use=POModuleOverriden) + assert len(rules) == 0 + + +def test_registry_search_overrides_deprecation() -> None: + msg = "The 'search_overrides' method is deprecated. Use 'search_rules' instead." + with pytest.warns(DeprecationWarning, match=msg): + rules = default_registry.search_overrides(use=POTopLevel2) + + # It should still work as usual + assert len(rules) == 1 + assert rules[0].use == POTopLevel2 + + # The rules from ``.get_overrides()`` should return ``ApplyRule`` and + # not the old ``OverrideRule``. + assert isinstance(rules[0], ApplyRule) + + +def test_from_apply_rules() -> None: + rules = [ + ApplyRule( + for_patterns=Patterns(include=["sample.com"]), + use=POTopLevel1, + instead_of=POTopLevelOverriden2, + ) + ] + + registry = PageObjectRegistry.from_apply_rules(rules) + + assert registry.get_rules() == rules + assert default_registry.get_rules() != rules + + +def test_from_override_rules_deprecation_using_ApplyRule() -> None: + rules = [ + ApplyRule( + for_patterns=Patterns(include=["sample.com"]), + use=POTopLevel1, + instead_of=POTopLevelOverriden2, + ) + ] + + msg = ( + "The 'from_override_rules' method is deprecated. " + "Use 'from_apply_rules' instead." + ) + with pytest.warns(DeprecationWarning, match=msg): + registry = PageObjectRegistry.from_override_rules(rules) + + assert registry.get_rules() == rules + assert default_registry.get_rules() != rules + + +def test_from_override_rules_deprecation_using_OverrideRule() -> None: + rules = [ + OverrideRule( + for_patterns=Patterns(include=["sample.com"]), + use=POTopLevel1, + instead_of=POTopLevelOverriden2, + ) + ] + + msg = ( + "The 'from_override_rules' method is deprecated. " + "Use 'from_apply_rules' instead." + ) + with pytest.warns(DeprecationWarning, match=msg): + registry = PageObjectRegistry.from_override_rules(rules) + + assert registry.get_rules() == rules + assert default_registry.get_rules() != rules + + +def test_handle_urls_deprecation() -> None: + before_count = len(default_registry.get_rules()) + + msg = ( + "The 'overrides' parameter in @handle_urls is deprecated. Use the " + "'instead_of' parameter." + ) + with pytest.warns(DeprecationWarning, match=msg): + + @handle_urls("example.com", overrides=CustomProductPage) + class PageWithDeprecatedOverrides: + ... + + # Despite the deprecation, it should still properly add the rule in the + # registry. + after_count = len(default_registry.get_rules()) + assert after_count == before_count + 1 + + # The added rule should have its deprecated 'overrides' parameter converted + # into the new 'instead_of' parameter. + rules = default_registry.search_rules( + instead_of=CustomProductPage, use=PageWithDeprecatedOverrides + ) + assert rules == [ + ApplyRule( + "example.com", + instead_of=CustomProductPage, + # mypy complains here since it's expecting a container class when + # declared, i.e, ``ItemPage[SomeItem]`` + use=PageWithDeprecatedOverrides, # type: ignore[arg-type] + ) + ] + + +def test_override_rule_deprecation() -> None: + msg = ( + "web_poet.rules.OverrideRule is deprecated, " + "instantiate web_poet.rules.ApplyRule instead." + ) + with pytest.warns(DeprecationWarning, match=msg): + OverrideRule(for_patterns=None, use=None) diff --git a/tests_extra/po_lib_sub_not_imported/__init__.py b/tests_extra/po_lib_sub_not_imported/__init__.py index a3c6f9d9..0327f4ef 100644 --- a/tests_extra/po_lib_sub_not_imported/__init__.py +++ b/tests_extra/po_lib_sub_not_imported/__init__.py @@ -12,7 +12,7 @@ class POBase: - expected_overrides: Type[ItemPage] + expected_instead_of: Type[ItemPage] expected_patterns: Patterns expected_meta: Dict[str, Any] @@ -21,8 +21,9 @@ class POLibSubOverridenNotImported: ... -@handle_urls("sub_example_not_imported.com", overrides=POLibSubOverridenNotImported) +@handle_urls("sub_not_imported.example", instead_of=POLibSubOverridenNotImported) class POLibSubNotImported(POBase): - expected_overrides = POLibSubOverridenNotImported - expected_patterns = Patterns(["sub_example_not_imported.com"]) + expected_instead_of = POLibSubOverridenNotImported + expected_patterns = Patterns(["sub_not_imported.example"]) + expected_to_return = None expected_meta = {} # type: ignore diff --git a/web_poet/__init__.py b/web_poet/__init__.py index 0d7fb36e..dbb3afc2 100644 --- a/web_poet/__init__.py +++ b/web_poet/__init__.py @@ -1,5 +1,4 @@ from .fields import field, item_from_fields, item_from_fields_sync -from .overrides import OverrideRule, PageObjectRegistry, consume_modules from .page_inputs import ( BrowserHtml, HttpClient, @@ -15,6 +14,7 @@ ) from .pages import Injectable, ItemPage, ItemWebPage, Returns, WebPage from .requests import request_downloader_var +from .rules import ApplyRule, OverrideRule, PageObjectRegistry, consume_modules from .utils import cached_method default_registry = PageObjectRegistry() diff --git a/web_poet/_typing.py b/web_poet/_typing.py index a5ec9fed..9f2578c3 100644 --- a/web_poet/_typing.py +++ b/web_poet/_typing.py @@ -18,7 +18,14 @@ def is_generic_alias(obj) -> bool: def get_generic_parameter(cls): - for base in cls.__orig_bases__: + for base in getattr(cls, "__orig_bases__", []): if is_generic_alias(base): args = _get_args(base) return args[0] + + +def get_item_cls(cls, default=None): + param = get_generic_parameter(cls) + if param is None or isinstance(param, typing.TypeVar): # class is not parametrized + return default + return param diff --git a/web_poet/overrides.py b/web_poet/overrides.py index 3727d719..f80db428 100644 --- a/web_poet/overrides.py +++ b/web_poet/overrides.py @@ -1,276 +1,6 @@ -from __future__ import annotations # https://www.python.org/dev/peps/pep-0563/ - -import importlib -import importlib.util -import pkgutil import warnings -from collections import deque -from dataclasses import dataclass, field -from operator import attrgetter -from typing import Any, Dict, Iterable, List, Optional, Type, TypeVar, Union - -from url_matcher import Patterns - -from web_poet.pages import ItemPage -from web_poet.utils import as_list - -Strings = Union[str, Iterable[str]] - -PageObjectRegistryTV = TypeVar("PageObjectRegistryTV", bound="PageObjectRegistry") - - -@dataclass(frozen=True) -class OverrideRule: - """A single override rule that specifies when a Page Object should be used - in lieu of another. - - This is instantiated when using the :func:`web_poet.handle_urls` decorator. - It's also being returned as a ``List[OverrideRule]`` when calling the - ``web_poet.default_registry``'s :meth:`~.PageObjectRegistry.get_overrides` - method. - - You can access any of its attributes: - - * ``for_patterns`` - contains the list of URL patterns associated with - this rule. You can read the API documentation of the `url-matcher - <https://url-matcher.readthedocs.io/>`_ package for more information - about the patterns. - * ``use`` - The Page Object that will be **used**. - * ``instead_of`` - The Page Object that will be **replaced**. - * ``meta`` - Any other information you may want to store. This doesn't - do anything for now but may be useful for future API updates. - - .. tip:: - - The :class:`~.OverrideRule` is also hashable. This makes it easy to store - unique rules and identify any duplicates. - """ - - for_patterns: Patterns - use: Type[ItemPage] - instead_of: Type[ItemPage] - meta: Dict[str, Any] = field(default_factory=dict) - - def __hash__(self): - return hash((self.for_patterns, self.use, self.instead_of)) - - -class PageObjectRegistry(dict): - """This contains the mapping rules that associates the Page Objects available - for a given URL matching rule. - - Note that it's simply a ``dict`` subclass with added functionalities on - storing, retrieving, and searching for the :class:`~.OverrideRule` instances. - The **value** represents the :class:`~.OverrideRule` instance from which the - Page Object in the **key** is allowed to be used. Since it's essentially a - ``dict``, you can use any ``dict`` operations with it. - - ``web-poet`` already provides a default Registry named ``default_registry`` - for convenience. It can be directly accessed via: - - .. code-block:: python - - from web_poet import handle_urls, default_registry, WebPage - - @handle_urls("example.com", overrides=ProductPageObject) - class ExampleComProductPage(WebPage): - ... - - override_rules = default_registry.get_overrides() - - Notice that the ``@handle_urls`` that we're using is a part of the - ``default_registry``. This provides a shorter and quicker way to interact - with the built-in default :class:`~.PageObjectRegistry` instead of writing - the longer ``@default_registry.handle_urls``. - - .. note:: - - It is encouraged to simply use and import the already existing registry - via ``from web_poet import default_registry`` instead of creating your - own :class:`~.PageObjectRegistry` instance. Using multiple registries - would be unwieldy in most cases. - - However, it might be applicable in certain scenarios like storing custom - rules to separate it from the ``default_registry``. This :ref:`example - <overrides-custom-registry>` from the tutorial section may provide some - context. - """ - - @classmethod - def from_override_rules( - cls: Type[PageObjectRegistryTV], rules: List[OverrideRule] - ) -> PageObjectRegistryTV: - """An alternative constructor for creating a :class:`~.PageObjectRegistry` - instance by accepting a list of :class:`~.OverrideRule`. - - This is useful in cases wherein you need to store some selected rules - from multiple external packages. - """ - return cls({rule.use: rule for rule in rules}) - - def handle_urls( - self, - include: Strings, - *, - overrides: Type[ItemPage], - exclude: Optional[Strings] = None, - priority: int = 500, - **kwargs, - ): - """ - Class decorator that indicates that the decorated Page Object should be - used instead of the overridden one for a particular set the URLs. - - The Page Object that is **overridden** is declared using the ``overrides`` - parameter. - - The **override** mechanism only works on certain URLs that match the - ``include`` and ``exclude`` parameters. See the documentation of the - `url-matcher <https://url-matcher.readthedocs.io/>`_ package for more - information about them. - - Any extra parameters are stored as meta information that can be later used. - - :param include: The URLs that should be handled by the decorated Page Object. - :param overrides: The Page Object that should be `replaced`. - :param exclude: The URLs over which the override should **not** happen. - :param priority: The resolution priority in case of `conflicting` rules. - A conflict happens when the ``include``, ``override``, and ``exclude`` - parameters are the same. If so, the `highest priority` will be - chosen. - """ - - def wrapper(cls): - rule = OverrideRule( - for_patterns=Patterns( - include=as_list(include), - exclude=as_list(exclude), - priority=priority, - ), - use=cls, - instead_of=overrides, - meta=kwargs, - ) - # If it was already defined, we don't want to override it - if cls not in self: - self[cls] = rule - else: - warnings.warn( - f"Multiple @handle_urls annotations with the same 'overrides' " - f"are ignored in the same Registry. The following rule is " - f"ignored:\n{rule}", - stacklevel=2, - ) - - return cls - - return wrapper - - def get_overrides(self) -> List[OverrideRule]: - """Returns all of the :class:`~.OverrideRule` that were declared using - the ``@handle_urls`` annotation. - - .. warning:: - - Remember to consider calling :func:`~.web_poet.overrides.consume_modules` - beforehand to recursively import all submodules which contains the - ``@handle_urls`` annotations from external Page Objects. - """ - return list(self.values()) - - def search_overrides(self, **kwargs) -> List[OverrideRule]: - """Returns any :class:`OverrideRule` that has any of its attributes - match the rules inside the registry. - - Sample usage: - - .. code-block:: python - - rules = registry.search_overrides(use=ProductPO, instead_of=GenericPO) - print(len(rules)) # 1 - - """ - - # Short-circuit operation if "use" is the only search param used, since - # we know that it's being used as the dict key. - if {"use"} == kwargs.keys(): - rule = self.get(kwargs["use"]) - if rule: - return [rule] - return [] - - getter = attrgetter(*kwargs.keys()) - - def matcher(rule: OverrideRule): - attribs = getter(rule) - if not isinstance(attribs, tuple): - attribs = (attribs,) - if attribs == tuple(kwargs.values()): - return True - return False - - results = [] - for rule in self.get_overrides(): - if matcher(rule): - results.append(rule) - return results - - -def _walk_module(module: str) -> Iterable: - """Return all modules from a module recursively. - - Note that this will import all the modules and submodules. It returns the - provided module as well. - """ - - def onerror(err): - raise err # pragma: no cover - - spec = importlib.util.find_spec(module) - if not spec: - raise ImportError(f"Module {module} not found") - mod = importlib.import_module(spec.name) - yield mod - if spec.submodule_search_locations: - for info in pkgutil.walk_packages( - spec.submodule_search_locations, f"{spec.name}.", onerror - ): - mod = importlib.import_module(info.name) - yield mod - - -def consume_modules(*modules: str) -> None: - """This recursively imports all packages/modules so that the ``@handle_urls`` - annotation are properly discovered and imported. - - Let's take a look at an example: - - .. code-block:: python - - # FILE: my_page_obj_project/load_rules.py - - from web_poet import default_registry, consume_modules - - consume_modules("other_external_pkg.po", "another_pkg.lib") - rules = default_registry.get_overrides() - - For this case, the :class:`~.OverrideRule` are coming from: - - - ``my_page_obj_project`` `(since it's the same module as the file above)` - - ``other_external_pkg.po`` - - ``another_pkg.lib`` - - any other modules that was imported in the same process inside the - packages/modules above. - - If the ``default_registry`` had other ``@handle_urls`` annotations outside - of the packages/modules listed above, then the corresponding - :class:`~.OverrideRule` won't be returned. Unless, they were recursively - imported in some way similar to :func:`~.web_poet.overrides.consume_modules`. - """ - for module in modules: - gen = _walk_module(module) +from web_poet.rules import * # noqa: F401, F403 - # Inspired by itertools recipe: https://docs.python.org/3/library/itertools.html - # Using a deque() results in a tiny bit performance improvement that list(). - deque(gen, maxlen=0) +msg = "The 'web_poet.overrides' module has been moved into 'web_poet.rules'." +warnings.warn(msg, DeprecationWarning, stacklevel=2) diff --git a/web_poet/pages.py b/web_poet/pages.py index 5d268759..77b39af3 100644 --- a/web_poet/pages.py +++ b/web_poet/pages.py @@ -3,7 +3,7 @@ import attr -from web_poet._typing import get_generic_parameter +from web_poet._typing import get_item_cls from web_poet.fields import FieldsMixin, item_from_fields from web_poet.mixins import ResponseShortcutsMixin from web_poet.page_inputs import HttpResponse @@ -41,16 +41,13 @@ def is_injectable(cls: typing.Any) -> bool: class Returns(typing.Generic[ItemT]): - """Inherit from this generic mixin to change the item type used by + """Inherit from this generic mixin to change the item class used by :class:`~.ItemPage`""" @property def item_cls(self) -> typing.Type[ItemT]: """Item class""" - param = get_generic_parameter(self.__class__) - if isinstance(param, typing.TypeVar): # class is not parametrized - return dict # type: ignore[return-value] - return param + return get_item_cls(self.__class__, default=dict) class ItemPage(Injectable, Returns[ItemT]): diff --git a/web_poet/rules.py b/web_poet/rules.py new file mode 100644 index 00000000..b1fd5fbc --- /dev/null +++ b/web_poet/rules.py @@ -0,0 +1,360 @@ +from __future__ import annotations # https://www.python.org/dev/peps/pep-0563/ + +import importlib +import importlib.util +import pkgutil +import warnings +from collections import deque +from operator import attrgetter +from typing import Any, Dict, Iterable, List, Optional, Type, TypeVar, Union + +import attrs +from url_matcher import Patterns + +from web_poet._typing import get_item_cls +from web_poet.pages import ItemPage +from web_poet.utils import _create_deprecated_class, as_list, str_to_pattern + +Strings = Union[str, Iterable[str]] + +PageObjectRegistryTV = TypeVar("PageObjectRegistryTV", bound="PageObjectRegistry") + + +@attrs.define(frozen=True) +class ApplyRule: + """A rule that primarily applies Page Object and Item overrides for a given + URL pattern. + + This is instantiated when using the :func:`web_poet.handle_urls` decorator. + It's also being returned as a ``List[ApplyRule]`` when calling the + ``web_poet.default_registry``'s :meth:`~.PageObjectRegistry.get_rules` + method. + + You can access any of its attributes: + + * ``for_patterns`` - contains the list of URL patterns associated with + this rule. You can read the API documentation of the `url-matcher + <https://url-matcher.readthedocs.io/>`_ package for more information + about the patterns. + * ``use`` - The Page Object that will be **used** in cases where the URL + pattern represented by the ``for_patterns`` attribute is matched. + * ``instead_of`` - *(optional)* The Page Object that will be **replaced** + with the Page Object specified via the ``use`` parameter. + * ``to_return`` - *(optional)* The item class which the **used** + * ``to_return`` - *(optional)* The item class that the Page Object specified + in ``use`` is capable of returning. + * ``meta`` - *(optional)* Any other information you may want to store. + This doesn't do anything for now but may be useful for future API updates. + + The main functionality of this class lies in the ``instead_of`` and ``to_return`` + parameters. Should both of these be omitted, then :class:`~.ApplyRule` simply + tags which URL patterns the given Page Object defined in ``use`` is expected + to be used on. + + When ``to_return`` is not None (e.g. ``to_return=MyItem``), + the Page Object in ``use`` is declared as capable of returning a certain + item class (i.e. ``MyItem``). + + When ``instead_of`` is not None (e.g. ``instead_of=ReplacedPageObject``), + the rule adds an expectation that the ``ReplacedPageObject`` wouldn't + be used for the URLs matching ``for_patterns``, since the Page Object + in ``use`` will replace it. + + If there are multiple rules which match a certain URL, the rule + to apply is picked based on the priorities set in ``for_patterns``. + + More information regarding its usage in :ref:`intro-overrides`. + + .. tip:: + + The :class:`~.ApplyRule` is also hashable. This makes it easy to store + unique rules and identify any duplicates. + """ + + for_patterns: Patterns = attrs.field(converter=str_to_pattern) + use: Type[ItemPage] = attrs.field(kw_only=True) + instead_of: Optional[Type[ItemPage]] = attrs.field(default=None, kw_only=True) + to_return: Optional[Type[Any]] = attrs.field(default=None, kw_only=True) + meta: Dict[str, Any] = attrs.field(factory=dict, kw_only=True) + + def __hash__(self): + return hash((self.for_patterns, self.use, self.instead_of, self.to_return)) + + +class PageObjectRegistry(dict): + """This contains the :class:`~.ApplyRule` that associates the Page Objects + alongside its Items for a given URL matching rule. + + PageObjectRegistry is a ``dict`` subclass with added functionalities on + storing, retrieving, and searching for the :class:`~.ApplyRule` instances. + The **value** represents the :class:`~.ApplyRule` instance from which the + Page Object in the **key** is allowed to be used. Since it's essentially a + ``dict``, you can use any ``dict`` operations with it. + + ``web-poet`` already provides a default Registry named ``default_registry`` + for convenience. It can be directly accessed via: + + .. code-block:: python + + from web_poet import handle_urls, default_registry, WebPage + + @handle_urls("example.com", instead_of=ProductPageObject) + class ExampleComProductPage(WebPage[ProductItem]): + ... + + override_rules = default_registry.get_rules() + + Notice that the ``@handle_urls`` decorator that we're using is a part of the + ``default_registry``. This provides a shorter and quicker way to interact + with the built-in default :class:`~.PageObjectRegistry` instead of writing + the longer ``@default_registry.handle_urls``. + + .. note:: + + It is encouraged to simply use and import the already existing registry + via ``from web_poet import default_registry`` instead of creating your + own :class:`~.PageObjectRegistry` instance. Using multiple registries + would be unwieldy in most cases. + + However, it might be applicable in certain scenarios like storing custom + rules to separate it from the ``default_registry``. This :ref:`example + <overrides-custom-registry>` from the tutorial section may provide some + context. + """ + + @classmethod + def from_apply_rules( + cls: Type[PageObjectRegistryTV], rules: List[ApplyRule] + ) -> PageObjectRegistryTV: + """An alternative constructor for creating a :class:`~.PageObjectRegistry` + instance by accepting a list of :class:`~.ApplyRule`. + + This is useful in cases wherein you need to store some selected rules + from multiple external packages. See this :ref:`example + <overrides-custom-registry>`. + """ + return cls({rule.use: rule for rule in rules}) + + @classmethod + def from_override_rules( + cls: Type[PageObjectRegistryTV], rules: List[ApplyRule] + ) -> PageObjectRegistryTV: + """Deprecated. Use :meth:`~.PageObjectRegistry.from_apply_rules` instead.""" + msg = ( + "The 'from_override_rules' method is deprecated. " + "Use 'from_apply_rules' instead." + ) + warnings.warn(msg, DeprecationWarning, stacklevel=2) + return cls.from_apply_rules(rules) + + def handle_urls( + self, + include: Strings, + *, + overrides: Optional[Type[ItemPage]] = None, + instead_of: Optional[Type[ItemPage]] = None, + to_return: Optional[Type] = None, + exclude: Optional[Strings] = None, + priority: int = 500, + **kwargs, + ): + """ + Class decorator that indicates that the decorated Page Object should work + for the given URL patterns. + + The URL patterns are matched using the ``include`` and ``exclude`` + parameters while ``priority`` breaks any ties. See the documentation + of the `url-matcher <https://url-matcher.readthedocs.io/>`_ package for + more information about them. + + This decorator is able to derive the item class returned by the Page + Object (see :ref:`item-class-example` section for some examples). This is + important since it marks what type of item the Page Object is capable of + returning for the given URL patterns. For certain advanced cases, you can + pass a ``to_return`` parameter which replaces any derived values (though + this isn't generally recommended). + + Passing another Page Object into the ``instead_of`` parameter indicates + that the decorated Page Object will be used instead of that for the given + set of URL patterns. This is the concept of **overrides** (see the + :ref:`intro-overrides` section for more info`). + + Any extra parameters are stored as meta information that can be later used. + + :param include: The URLs that should be handled by the decorated Page Object. + :param instead_of: The Page Object that should be `replaced`. + :param to_return: The item class holding the data returned by the Page Object. + This could be omitted as it could be derived from the ``Returns[ItemClass]`` + or ``ItemPage[ItemClass]`` declaration of the Page Object. See + :ref:`item-classes` section. Code example in :ref:`combination` subsection. + :param exclude: The URLs over which the override should **not** happen. + :param priority: The resolution priority in case of `conflicting` rules. + A conflict happens when the ``include``, ``override``, and ``exclude`` + parameters are the same. If so, the `highest priority` will be + chosen. + """ + + def wrapper(cls): + + if overrides is not None: + msg = ( + "The 'overrides' parameter in @handle_urls is deprecated. " + "Use the 'instead_of' parameter instead. If both 'instead_of' " + "and 'overrides' are provided, the latter is ignored." + ) + warnings.warn(msg, DeprecationWarning, stacklevel=2) + + rule = ApplyRule( + for_patterns=Patterns( + include=as_list(include), + exclude=as_list(exclude), + priority=priority, + ), + use=cls, + instead_of=instead_of or overrides, + to_return=to_return or get_item_cls(cls), + meta=kwargs, + ) + # If it was already defined, we don't want to override it + if cls not in self: + self[cls] = rule + else: + warnings.warn( + f"Multiple @handle_urls decorators for the same Page Object " + f"are ignored in the same Registry. The following rule is " + f"ignored:\n{rule}", + stacklevel=2, + ) + + return cls + + return wrapper + + def get_rules(self) -> List[ApplyRule]: + """Returns all the :class:`~.ApplyRule` that were declared using + the ``@handle_urls`` decorator. + + .. note:: + + Remember to consider calling :func:`~.web_poet.rules.consume_modules` + beforehand to recursively import all submodules which contains the + ``@handle_urls`` decorators from external Page Objects. + """ + return list(self.values()) + + def get_overrides(self) -> List[ApplyRule]: + """Deprecated, use :meth:`~.PageObjectRegistry.get_rules` instead.""" + msg = "The 'get_overrides' method is deprecated. Use 'get_rules' instead." + warnings.warn(msg, DeprecationWarning, stacklevel=2) + return self.get_rules() + + def search_rules(self, **kwargs) -> List[ApplyRule]: + """Return any :class:`ApplyRule` from the registry that matches with all + of the provided attributes. + + Sample usage: + + .. code-block:: python + + rules = registry.search_rules(use=ProductPO, instead_of=GenericPO) + print(len(rules)) # 1 + print(rules[0].use) # ProductPO + print(rules[0].instead_of) # GenericPO + + """ + + # Short-circuit operation if "use" is the only search param used, since + # we know that it's being used as the dict key. + if {"use"} == kwargs.keys(): + rule = self.get(kwargs["use"]) + if rule: + return [rule] + return [] + + getter = attrgetter(*kwargs.keys()) + + def matcher(rule: ApplyRule): + attribs = getter(rule) + if not isinstance(attribs, tuple): + attribs = (attribs,) + if attribs == tuple(kwargs.values()): + return True + return False + + results = [] + for rule in self.get_rules(): + if matcher(rule): + results.append(rule) + return results + + def search_overrides(self, **kwargs) -> List[ApplyRule]: + """Deprecated, use :meth:`~.PageObjectRegistry.search_rules` instead.""" + msg = ( + "The 'search_overrides' method is deprecated. " + "Use 'search_rules' instead." + ) + warnings.warn(msg, DeprecationWarning, stacklevel=2) + return self.search_rules(**kwargs) + + +def _walk_module(module: str) -> Iterable: + """Return all modules from a module recursively. + + Note that this will import all the modules and submodules. It returns the + provided module as well. + """ + + def onerror(err): + raise err # pragma: no cover + + spec = importlib.util.find_spec(module) + if not spec: + raise ImportError(f"Module {module} not found") + mod = importlib.import_module(spec.name) + yield mod + if spec.submodule_search_locations: + for info in pkgutil.walk_packages( + spec.submodule_search_locations, f"{spec.name}.", onerror + ): + mod = importlib.import_module(info.name) + yield mod + + +def consume_modules(*modules: str) -> None: + """This recursively imports all packages/modules so that the ``@handle_urls`` + decorators are properly discovered and imported. + + Let's take a look at an example: + + .. code-block:: python + + # FILE: my_page_obj_project/load_rules.py + + from web_poet import default_registry, consume_modules + + consume_modules("other_external_pkg.po", "another_pkg.lib") + rules = default_registry.get_rules() + + For this case, the :class:`~.ApplyRule` are coming from: + + - ``my_page_obj_project`` `(since it's the same module as the file above)` + - ``other_external_pkg.po`` + - ``another_pkg.lib`` + - any other modules that was imported in the same process inside the + packages/modules above. + + If the ``default_registry`` had other ``@handle_urls`` decorators outside of + the packages/modules listed above, then the corresponding :class:`~.ApplyRule` + won't be returned. Unless, they were recursively imported in some way similar + to :func:`~.web_poet.rules.consume_modules`. + """ + + for module in modules: + gen = _walk_module(module) + + # Inspired by itertools recipe: https://docs.python.org/3/library/itertools.html + # Using a deque() results in a tiny bit performance improvement that list(). + deque(gen, maxlen=0) + + +OverrideRule = _create_deprecated_class("OverrideRule", ApplyRule, warn_once=False) diff --git a/web_poet/utils.py b/web_poet/utils.py index 00fe8166..14b5ce05 100644 --- a/web_poet/utils.py +++ b/web_poet/utils.py @@ -3,10 +3,11 @@ from collections.abc import Iterable from functools import lru_cache, wraps from types import MethodType -from typing import Any, List, Optional +from typing import Any, List, Optional, Union from warnings import warn from async_lru import alru_cache +from url_matcher import Patterns def _clspath(cls, forced=None): @@ -230,3 +231,9 @@ async def ensure_awaitable(obj): if inspect.isawaitable(obj): return await obj return obj + + +def str_to_pattern(url_pattern: Union[str, Patterns]) -> Patterns: + if isinstance(url_pattern, Patterns): + return url_pattern + return Patterns([url_pattern])