Skip to content

Commit

Permalink
Merge 774619c into 399371f
Browse files Browse the repository at this point in the history
  • Loading branch information
BurnzZ authored Jan 30, 2023
2 parents 399371f + 774619c commit 1caa0de
Show file tree
Hide file tree
Showing 32 changed files with 2,126 additions and 405 deletions.
122 changes: 122 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,128 @@
Changelog
=========

TBR
---

* Added support for item classes which are used as dependencies in page objects
and spider callbacks. The following is now possible:

.. code-block:: python
import attrs
import scrapy
from web_poet import WebPage, handle_urls, field
from scrapy_poet import DummyResponse
@attrs.define
class Image:
url: str
@handle_urls("example.com")
class ProductImagePage(WebPage[Image]):
@field
def url(self) -> str:
return self.css("#product img ::attr(href)").get("")
@attrs.define
class Product:
name: str
image: Image
@handle_urls("example.com")
@attrs.define
class ProductPage(WebPage[Product]):
# ✨ NEW: Notice that the page object can ask for items as dependencies.
# An instance of ``Image`` is injected behind the scenes by calling the
# ``.to_item()`` method of ``ProductImagePage``.
image_item: Image
@field
def name(self) -> str:
return self.css("h1.name ::text").get("")
@field
def image(self) -> Image:
return self.image_item
class MySpider(scrapy.Spider):
name = "myspider"
def start_requests(self):
yield scrapy.Request(
"https://example.com/products/some-product", self.parse
)
# ✨ NEW: Notice that we're directly using the item here and not the
# page object.
def parse(self, response: DummyResponse, item: Product):
return item
In line with this, the following new features were made:

* Added a new :class:`scrapy_poet.page_input_providers.ItemProvider` which
makes the usage above possible.

* An item class is now supported by :func:`scrapy_poet.callback_for`
alongside the usual page objects. This means that it won't raise a
:class:`TypeError` anymore when not passing a subclass of
:class:`web_poet.pages.ItemPage`.

* New exception: :class:`scrapy_poet.injection_errors.ProviderDependencyDeadlockError`.
This is raised when it's not possible to create the dependencies due to
a deadlock in their sub-dependencies, e.g. due to a circular dependency
between page objects.

* Moved some of the utility functions from the test module into
``scrapy_poet.utils.testing``.

* Documentation improvements.

* Deprecations:

* The ``SCRAPY_POET_OVERRIDES`` setting has been replaced by
``SCRAPY_POET_RULES``.

* Backward incompatible changes:

* Overriding the default registry used via ``SCRAPY_POET_OVERRIDES_REGISTRY``
is not possible anymore.

* The following type aliases have been removed:

* ``scrapy_poet.overrides.RuleAsTuple``
* ``scrapy_poet.overrides.RuleFromUser``

* The :class:`scrapy_poet.page_input_providers.PageObjectInputProvider` base
class has these changes:

* It now accepts an instance of :class:`scrapy_poet.injection.Injector`
in its constructor instead of :class:`scrapy.crawler.Crawler`. Although
you can still access the :class:`scrapy.crawler.Crawler` via the
``Injector.crawler`` attribute.

* :meth:`scrapy_poet.page_input_providers.PageObjectInputProvider.is_provided`
is now an instance method instead of a class method.

* The :class:`scrapy_poet.injection.Injector`'s attribute and constructor
parameter called ``overrides_registry`` is now simply called ``registry``.

* The ``scrapy_poet.overrides`` module which contained ``OverridesRegistryBase``
and ``OverridesRegistry`` has now been removed. Instead, scrapy-poet directly
uses :class:`web_poet.rules.RulesRegistry`.

Everything should pretty much the same except for
:meth:`web_poet.rules.RulesRegistry.overrides_for` now accepts :class:`str`,
:class:`web_poet.page_inputs.http.RequestUrl`, or
:class:`web_poet.page_inputs.http.ResponseUrl` instead of
:class:`scrapy.http.Request`.

* This also means that the registry doesn't accept tuples as rules anymore.
Only :class:`web_poet.rules.ApplyRule` instances are allowed. The same goes
for ``SCRAPY_POET_RULES`` (and the deprecated ``SCRAPY_POET_OVERRIDES``).


0.8.0 (2023-01-24)
------------------

Expand Down
8 changes: 1 addition & 7 deletions docs/api_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ API
Injection Middleware
====================

.. automodule:: scrapy_poet.middleware
.. automodule:: scrapy_poet.downloadermiddlewares
:members:

Page Input Providers
Expand Down Expand Up @@ -43,9 +43,3 @@ Injection errors

.. automodule:: scrapy_poet.injection_errors
:members:

Overrides
=========

.. automodule:: scrapy_poet.overrides
:members:
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ To get started, see :ref:`intro-install` and :ref:`intro-tutorial`.
:caption: Advanced
:maxdepth: 1

overrides
rules-from-web-poet
providers
testing

Expand Down
50 changes: 16 additions & 34 deletions docs/intro/basic-tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -414,17 +414,17 @@ The spider won't work anymore after the change. The reason is that it
is using the new base Page Objects and they are empty.
Let's fix it by instructing ``scrapy-poet`` to use the Books To Scrape (BTS)
Page Objects for URLs belonging to the domain ``toscrape.com``. This must
be done by configuring ``SCRAPY_POET_OVERRIDES`` into ``settings.py``:
be done by configuring ``SCRAPY_POET_RULES`` into ``settings.py``:

.. code-block:: python
"SCRAPY_POET_OVERRIDES": [
"SCRAPY_POET_RULES": [
("toscrape.com", BTSBookListPage, BookListPage),
("toscrape.com", BTSBookPage, BookPage)
]
The spider is back to life!
``SCRAPY_POET_OVERRIDES`` contain rules that overrides the Page Objects
``SCRAPY_POET_RULES`` contain rules that overrides the Page Objects
used for a particular domain. In this particular case, Page Objects
``BTSBookListPage`` and ``BTSBookPage`` will be used instead of
``BookListPage`` and ``BookPage`` for any request whose domain is
Expand Down Expand Up @@ -465,16 +465,18 @@ to implement new ones:
The last step is configuring the overrides so that these new Page Objects
are used for the domain
``bookpage.com``. This is how ``SCRAPY_POET_OVERRIDES`` should look like into
``bookpage.com``. This is how ``SCRAPY_POET_RULES`` should look like into
``settings.py``:

.. code-block:: python
"SCRAPY_POET_OVERRIDES": [
("toscrape.com", BTSBookListPage, BookListPage),
("toscrape.com", BTSBookPage, BookPage),
("bookpage.com", BPBookListPage, BookListPage),
("bookpage.com", BPBookPage, BookPage)
from web_poet import ApplyRule
"SCRAPY_POET_RULES": [
ApplyRule("toscrape.com", use=BTSBookListPage, instead_of=BookListPage),
ApplyRule("toscrape.com", use=BTSBookPage, instead_of=BookPage),
ApplyRule("bookpage.com", use=BPBookListPage, instead_of=BookListPage),
ApplyRule("bookpage.com", use=BPBookPage, instead_of=BookPage)
]
The spider is now ready to extract books from both sites 😀.
Expand All @@ -490,27 +492,6 @@ for a particular domain, but more complex URL patterns are also possible.
For example, the pattern ``books.toscrape.com/cataloge/category/``
is accepted and it would restrict the override only to category pages.

It is even possible to configure more complex patterns by using the
:py:class:`web_poet.rules.ApplyRule` class instead of a triplet in
the configuration. Another way of declaring the earlier config
for ``SCRAPY_POET_OVERRIDES`` would be the following:

.. code-block:: python
from url_matcher import Patterns
from web_poet import ApplyRule
SCRAPY_POET_OVERRIDES = [
ApplyRule(for_patterns=Patterns(["toscrape.com"]), use=BTSBookListPage, instead_of=BookListPage),
ApplyRule(for_patterns=Patterns(["toscrape.com"]), use=BTSBookPage, instead_of=BookPage),
ApplyRule(for_patterns=Patterns(["bookpage.com"]), use=BPBookListPage, instead_of=BookListPage),
ApplyRule(for_patterns=Patterns(["bookpage.com"]), use=BPBookPage, instead_of=BookPage),
]
As you can see, this could get verbose. The earlier tuple config simply offers
a shortcut to be more concise.

.. note::

Also see the `url-matcher <https://url-matcher.readthedocs.io/en/stable/>`_
Expand All @@ -530,11 +511,11 @@ and store the :py:class:`web_poet.rules.ApplyRule` for you. All of the
# rules from other packages. Otherwise, it can be omitted.
# More info about this caveat on web-poet docs.
consume_modules("external_package_A", "another_ext_package.lib")
SCRAPY_POET_OVERRIDES = default_registry.get_rules()
SCRAPY_POET_RULES = default_registry.get_rules()
For more info on this, you can refer to these docs:

* ``scrapy-poet``'s :ref:`overrides` Tutorial section.
* ``scrapy-poet``'s :ref:`rules-from-web-poet` Tutorial section.
* External `web-poet`_ docs.

* Specifically, the :external:ref:`rules-intro` Tutorial section.
Expand All @@ -545,7 +526,8 @@ Next steps
Now that you know how ``scrapy-poet`` is supposed to work, what about trying to
apply it to an existing or new Scrapy project?

Also, please check the :ref:`overrides` and :ref:`providers` sections as well as
refer to spiders in the "example" folder: https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders
Also, please check the :ref:`rules-from-web-poet` and :ref:`providers` sections
as well as refer to spiders in the "example" folder:
https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders

.. _Scrapy Tutorial: https://docs.scrapy.org/en/latest/intro/tutorial.html
Loading

0 comments on commit 1caa0de

Please sign in to comment.