Merge 774619c into 399371f

scrapinghub · Jan 30, 2023 · 1caa0de · 1caa0de
2 parents 399371f + 774619c
commit 1caa0de
Show file tree

Hide file tree

Showing 32 changed files with 2,126 additions and 405 deletions.
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -2,6 +2,128 @@
 Changelog
 =========
 
+TBR
+---
+
+* Added support for item classes which are used as dependencies in page objects
+  and spider callbacks. The following is now possible:
+
+  .. code-block:: python
+
+      import attrs
+      import scrapy
+      from web_poet import WebPage, handle_urls, field
+      from scrapy_poet import DummyResponse
+
+      @attrs.define
+      class Image:
+          url: str
+
+      @handle_urls("example.com")
+      class ProductImagePage(WebPage[Image]):
+          @field
+          def url(self) -> str:
+              return self.css("#product img ::attr(href)").get("")
+
+      @attrs.define
+      class Product:
+          name: str
+          image: Image
+
+      @handle_urls("example.com")
+      @attrs.define
+      class ProductPage(WebPage[Product]):
+          # ✨ NEW: Notice that the page object can ask for items as dependencies.
+          # An instance of ``Image`` is injected behind the scenes by calling the
+          # ``.to_item()`` method of ``ProductImagePage``.
+          image_item: Image
+
+          @field
+          def name(self) -> str:
+              return self.css("h1.name ::text").get("")
+
+          @field
+          def image(self) -> Image:
+              return self.image_item
+
+      class MySpider(scrapy.Spider):
+          name = "myspider"
+
+          def start_requests(self):
+              yield scrapy.Request(
+                  "https://example.com/products/some-product", self.parse
+              )
+
+          # ✨ NEW: Notice that we're directly using the item here and not the
+          # page object.
+          def parse(self, response: DummyResponse, item: Product):
+              return item
+
+
+  In line with this, the following new features were made:
+
+    * Added a new :class:`scrapy_poet.page_input_providers.ItemProvider` which
+      makes the usage above possible.
+
+    * An item class is now supported by :func:`scrapy_poet.callback_for`
+      alongside the usual page objects. This means that it won't raise a
+      :class:`TypeError` anymore when not passing a subclass of
+      :class:`web_poet.pages.ItemPage`.
+
+    * New exception: :class:`scrapy_poet.injection_errors.ProviderDependencyDeadlockError`.
+      This is raised when it's not possible to create the dependencies due to
+      a deadlock in their sub-dependencies, e.g. due to a circular dependency
+      between page objects.
+
+* Moved some of the utility functions from the test module into
+  ``scrapy_poet.utils.testing``.
+
+* Documentation improvements.
+
+* Deprecations:
+
+    * The ``SCRAPY_POET_OVERRIDES`` setting has been replaced by
+      ``SCRAPY_POET_RULES``.
+
+* Backward incompatible changes:
+
+    * Overriding the default registry used via ``SCRAPY_POET_OVERRIDES_REGISTRY``
+      is not possible anymore.
+
+    * The following type aliases have been removed:
+
+        * ``scrapy_poet.overrides.RuleAsTuple``
+        * ``scrapy_poet.overrides.RuleFromUser``
+
+    * The :class:`scrapy_poet.page_input_providers.PageObjectInputProvider` base
+      class has these changes:
+
+        * It now accepts an instance of :class:`scrapy_poet.injection.Injector`
+          in its constructor instead of :class:`scrapy.crawler.Crawler`. Although
+          you can still access the :class:`scrapy.crawler.Crawler` via the
+          ``Injector.crawler`` attribute.
+
+        * :meth:`scrapy_poet.page_input_providers.PageObjectInputProvider.is_provided`
+          is now an instance method instead of a class method.
+
+    * The :class:`scrapy_poet.injection.Injector`'s attribute and constructor
+      parameter  called ``overrides_registry`` is now simply called ``registry``.
+
+    * The ``scrapy_poet.overrides`` module which contained ``OverridesRegistryBase``
+      and ``OverridesRegistry`` has now been removed. Instead, scrapy-poet directly
+      uses :class:`web_poet.rules.RulesRegistry`.
+
+      Everything should pretty much the same except for
+      :meth:`web_poet.rules.RulesRegistry.overrides_for` now accepts :class:`str`,
+      :class:`web_poet.page_inputs.http.RequestUrl`, or
+      :class:`web_poet.page_inputs.http.ResponseUrl` instead of
+      :class:`scrapy.http.Request`.
+
+    * This also means that the registry doesn't accept tuples as rules anymore.
+      Only :class:`web_poet.rules.ApplyRule` instances are allowed. The same goes
+      for ``SCRAPY_POET_RULES`` (and the deprecated ``SCRAPY_POET_OVERRIDES``).
+
+
 0.8.0 (2023-01-24)
 ------------------
 

diff --git a/docs/api_reference.rst b/docs/api_reference.rst
@@ -14,7 +14,7 @@ API
 Injection Middleware
 ====================
 
-.. automodule:: scrapy_poet.middleware
+.. automodule:: scrapy_poet.downloadermiddlewares
    :members:
 
 Page Input Providers
@@ -43,9 +43,3 @@ Injection errors
 
 .. automodule:: scrapy_poet.injection_errors
    :members:
-
-Overrides
-=========
-
-.. automodule:: scrapy_poet.overrides
-   :members:
diff --git a/docs/index.rst b/docs/index.rst
@@ -43,7 +43,7 @@ To get started, see :ref:`intro-install` and :ref:`intro-tutorial`.
    :caption: Advanced
    :maxdepth: 1
 
-   overrides
+   rules-from-web-poet
    providers
    testing
 

diff --git a/docs/intro/basic-tutorial.rst b/docs/intro/basic-tutorial.rst
@@ -414,17 +414,17 @@ The spider won't work anymore after the change. The reason is that it
 is using the new base Page Objects and they are empty.
 Let's fix it by instructing ``scrapy-poet`` to use the Books To Scrape (BTS)
 Page Objects for URLs belonging to the domain ``toscrape.com``. This must
-be done by configuring ``SCRAPY_POET_OVERRIDES`` into ``settings.py``:
+be done by configuring ``SCRAPY_POET_RULES`` into ``settings.py``:
 
 .. code-block:: python
 
-    "SCRAPY_POET_OVERRIDES": [
+    "SCRAPY_POET_RULES": [
         ("toscrape.com", BTSBookListPage, BookListPage),
         ("toscrape.com", BTSBookPage, BookPage)
     ]
 
 The spider is back to life!
-``SCRAPY_POET_OVERRIDES`` contain rules that overrides the Page Objects
+``SCRAPY_POET_RULES`` contain rules that overrides the Page Objects
 used for a particular domain. In this particular case, Page Objects
 ``BTSBookListPage`` and ``BTSBookPage`` will be used instead of
 ``BookListPage`` and ``BookPage`` for any request whose domain is
@@ -465,16 +465,18 @@ to implement new ones:
 
 The last step is configuring the overrides so that these new Page Objects
 are used for the domain
-``bookpage.com``. This is how ``SCRAPY_POET_OVERRIDES`` should look like into
+``bookpage.com``. This is how ``SCRAPY_POET_RULES`` should look like into
 ``settings.py``:
 
 .. code-block:: python
 
-    "SCRAPY_POET_OVERRIDES": [
-        ("toscrape.com", BTSBookListPage, BookListPage),
-        ("toscrape.com", BTSBookPage, BookPage),
-        ("bookpage.com", BPBookListPage, BookListPage),
-        ("bookpage.com", BPBookPage, BookPage)
+    from web_poet import ApplyRule
+
+    "SCRAPY_POET_RULES": [
+        ApplyRule("toscrape.com", use=BTSBookListPage, instead_of=BookListPage),
+        ApplyRule("toscrape.com", use=BTSBookPage, instead_of=BookPage),
+        ApplyRule("bookpage.com", use=BPBookListPage, instead_of=BookListPage),
+        ApplyRule("bookpage.com", use=BPBookPage, instead_of=BookPage)
     ]
 
 The spider is now ready to extract books from both sites 😀.
@@ -490,27 +492,6 @@ for a particular domain, but more complex URL patterns are also possible.
 For example, the pattern ``books.toscrape.com/cataloge/category/``
 is accepted and it would restrict the override only to category pages.
 
-It is even possible to configure more complex patterns by using the
-:py:class:`web_poet.rules.ApplyRule` class instead of a triplet in
-the configuration. Another way of declaring the earlier config
-for ``SCRAPY_POET_OVERRIDES`` would be the following:
-
-.. code-block:: python
-
-    from url_matcher import Patterns
-    from web_poet import ApplyRule
-
-
-    SCRAPY_POET_OVERRIDES = [
-        ApplyRule(for_patterns=Patterns(["toscrape.com"]), use=BTSBookListPage, instead_of=BookListPage),
-        ApplyRule(for_patterns=Patterns(["toscrape.com"]), use=BTSBookPage, instead_of=BookPage),
-        ApplyRule(for_patterns=Patterns(["bookpage.com"]), use=BPBookListPage, instead_of=BookListPage),
-        ApplyRule(for_patterns=Patterns(["bookpage.com"]), use=BPBookPage, instead_of=BookPage),
-    ]
-
-As you can see, this could get verbose. The earlier tuple config simply offers
-a shortcut to be more concise.
-
 .. note::
 
     Also see the `url-matcher <https://url-matcher.readthedocs.io/en/stable/>`_
@@ -530,11 +511,11 @@ and store the :py:class:`web_poet.rules.ApplyRule` for you. All of the
     # rules from other packages. Otherwise, it can be omitted.
     # More info about this caveat on web-poet docs.
     consume_modules("external_package_A", "another_ext_package.lib")
-    SCRAPY_POET_OVERRIDES = default_registry.get_rules()
+    SCRAPY_POET_RULES = default_registry.get_rules()
 
 For more info on this, you can refer to these docs:
 
-    * ``scrapy-poet``'s :ref:`overrides` Tutorial section.
+    * ``scrapy-poet``'s :ref:`rules-from-web-poet` Tutorial section.
     * External `web-poet`_ docs.
 
         * Specifically, the :external:ref:`rules-intro` Tutorial section.
@@ -545,7 +526,8 @@ Next steps
 Now that you know how ``scrapy-poet`` is supposed to work, what about trying to
 apply it to an existing or new Scrapy project?
 
-Also, please check the :ref:`overrides` and :ref:`providers` sections as well as
-refer to spiders in the "example" folder: https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders
+Also, please check the :ref:`rules-from-web-poet` and :ref:`providers` sections
+as well as refer to spiders in the "example" folder:
+https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders
 
 .. _Scrapy Tutorial: https://docs.scrapy.org/en/latest/intro/tutorial.html