Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting to_return in web-poet rules #88

Merged
merged 80 commits into from
Jan 30, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
a638aec
initial integration of to_return from web_poet
BurnzZ Oct 12, 2022
ee30808
fix tests regarding expectations for param in rule
BurnzZ Oct 13, 2022
0452173
warn the user when the same URL pattern is present in the rule
BurnzZ Oct 13, 2022
e51a63d
add test case for when 'instead_of' and 'to_return' are both present
BurnzZ Oct 19, 2022
6c55de0
simplify tests and assert injected dependencies in the callback
BurnzZ Oct 31, 2022
3117530
add test case focusing on URL presence in the rules
BurnzZ Nov 1, 2022
3a69c83
properly test UndeclaredProvidedTypeError
BurnzZ Nov 1, 2022
a38cb06
refactor solution to resolve item dependencies using providers
BurnzZ Nov 3, 2022
4134457
fix typing for callback_for()
BurnzZ Nov 3, 2022
213549a
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Nov 22, 2022
9a00b63
move test utilies into scrapy_poet/utils/
BurnzZ Nov 23, 2022
49136cb
create recursive dependency resolution
BurnzZ Nov 24, 2022
a2260d7
add more test cases
BurnzZ Nov 29, 2022
9816f42
update ItemProvider to dynamically handle its dependency signature
BurnzZ Nov 30, 2022
86b7a97
code cleanup and fix some tests
BurnzZ Nov 30, 2022
7b8c7f2
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Nov 30, 2022
20f51a6
detect and raise errors on deadlocks
BurnzZ Nov 30, 2022
4b60fa9
fix failing injector test
BurnzZ Nov 30, 2022
caa1be6
ensure that provider dependencies are cached
BurnzZ Nov 30, 2022
ae05e90
modify deadlock detection to a simple try-except
BurnzZ Dec 1, 2022
d6a33a4
fix failing test_injection.py tests
BurnzZ Dec 1, 2022
a4cff73
ensure that .to_item() methods are only called once
BurnzZ Dec 1, 2022
6bc839f
add a test with a deeper dependency tree
BurnzZ Dec 1, 2022
4aedf16
test duplicate dependencies
BurnzZ Dec 1, 2022
56028d7
fix missing tests and imports
BurnzZ Dec 1, 2022
41ff13e
deprecate passing tuples in SCRAPY_POET_OVERRIDES and the Registry wi…
BurnzZ Dec 2, 2022
2ec6414
refactor Injector to simplify recursive dependency resolution of items
BurnzZ Dec 5, 2022
f3fb32d
polish code and tests
BurnzZ Dec 6, 2022
544236f
fix failing mypy and polish code
BurnzZ Dec 6, 2022
29f40ab
update CHANGELOG with new item class support
BurnzZ Dec 6, 2022
66f0c90
fix typo in CHANGELOG
BurnzZ Dec 6, 2022
2697ab0
improve test_web_poet_rules.py
BurnzZ Dec 6, 2022
35b0c8d
polishing comments and typing
BurnzZ Dec 9, 2022
d2beaf8
mention backward incompatible changes in CHANGELOG
BurnzZ Dec 12, 2022
d046903
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Dec 16, 2022
8f1450a
deprecate some settings, modules, and parameters to be overrides-agno…
BurnzZ Dec 16, 2022
6f0d36e
update documentation in line with the new Item Return functionality
BurnzZ Dec 16, 2022
77cf77c
update tutorial with more explanation on how Item Return works
BurnzZ Dec 16, 2022
efbdb66
update CHANGELOG to mention other backward incompatible changes
BurnzZ Dec 21, 2022
9b4cd48
add and improve docstrings, typing, and warning msgs
BurnzZ Dec 21, 2022
5d2f0f9
move some functions to new scrapy_poet.utils.testing module
BurnzZ Dec 21, 2022
58577a8
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Dec 21, 2022
afc04e9
Apply improvements from code review
BurnzZ Dec 21, 2022
4141239
prioritize newer settings than deprecated ones
BurnzZ Dec 21, 2022
dae69d8
simplify to_return doc example
BurnzZ Dec 22, 2022
ccfa9ea
fix and improve docs
BurnzZ Dec 23, 2022
e9bb33d
use DummyResponse on some examples
BurnzZ Dec 23, 2022
3667cc3
remove obsolete test
BurnzZ Dec 23, 2022
22c959d
Polish CHANGELOG from review
BurnzZ Jan 3, 2023
545e8f1
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Jan 3, 2023
83e0e84
fix missing imports in tests
BurnzZ Jan 3, 2023
47f213c
rename 'item type' → 'item class'
BurnzZ Jan 3, 2023
914a334
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Jan 4, 2023
6af1061
Fix conflicts; Merge branch 'new-web-poet' of ssh://github.com/scrapi…
BurnzZ Jan 4, 2023
190e3a6
use web-poet's _create_deprecated_class
BurnzZ Jan 6, 2023
2611199
remove incorrect line in CHANGELOG
BurnzZ Jan 6, 2023
7bd6783
remove scrapy-poet registry in lieu of web-poet's registry
BurnzZ Jan 10, 2023
3c6fdae
avoid using RulesRegistry.search() since it's slow
BurnzZ Jan 10, 2023
ef01f11
add test to check higher priority of PO subclass
BurnzZ Jan 10, 2023
f41b5c2
Merge pull request #103 from scrapinghub/to-return-override-docs
BurnzZ Jan 10, 2023
c658317
use RulesRegistry.search() again after optimizing it
BurnzZ Jan 10, 2023
3e852d7
fix doc grammar
BurnzZ Jan 11, 2023
4d25d8c
mark tests as xfail if it raises UndeclaredProvidedTypeError
BurnzZ Jan 13, 2023
e184c6f
better tests for clashing rules due to independent page objects with …
BurnzZ Jan 13, 2023
bf9b7bf
fix misleading class names
BurnzZ Jan 13, 2023
33a0391
add more tests on deadlock detection
BurnzZ Jan 13, 2023
141c495
use new web-poet==0.7.0
BurnzZ Jan 18, 2023
3d464e6
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Jan 18, 2023
8c410fe
fixed merge conflicts in CHANGELOG
BurnzZ Jan 18, 2023
00d5dd6
improve docs on settings
BurnzZ Jan 19, 2023
fd31c93
Merge branch 'master' into new-web-poet
BurnzZ Jan 19, 2023
199c46b
fix conflict in code
BurnzZ Jan 19, 2023
7c1f5f1
add test for checking deprecated SCRAPY_POET_OVERRIDES
BurnzZ Jan 19, 2023
44c6e60
add test when requesting an item but no page object
BurnzZ Jan 19, 2023
4791576
issue a warning when can't provide a page object or item for a given URL
BurnzZ Jan 19, 2023
e3b7a8e
remove support for custom registry via SCRAPY_POET_OVERRIDES_REGISTRY
BurnzZ Jan 19, 2023
0915b00
re-organize CHANGELOG
BurnzZ Jan 19, 2023
a46b1e2
fix some docs and comments for clarity
BurnzZ Jan 30, 2023
774619c
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ Jan 30, 2023
140239a
bump tool versions to fix CI failure
kmike Jan 30, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@ repos:
- id: black
language_version: python3
repo: https://github.com/ambv/black
rev: 22.3.0
rev: 22.12.0
- hooks:
- id: isort
language_version: python3
repo: https://github.com/PyCQA/isort
rev: 5.10.1
rev: 5.11.5
- hooks:
- id: flake8
language_version: python3
Expand Down
122 changes: 122 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,128 @@
Changelog
=========

TBR
---

* Added support for item classes which are used as dependencies in page objects
and spider callbacks. The following is now possible:

.. code-block:: python

import attrs
import scrapy
from web_poet import WebPage, handle_urls, field
from scrapy_poet import DummyResponse

@attrs.define
class Image:
url: str

@handle_urls("example.com")
class ProductImagePage(WebPage[Image]):
@field
def url(self) -> str:
return self.css("#product img ::attr(href)").get("")

@attrs.define
class Product:
name: str
image: Image

@handle_urls("example.com")
@attrs.define
class ProductPage(WebPage[Product]):
# ✨ NEW: Notice that the page object can ask for items as dependencies.
# An instance of ``Image`` is injected behind the scenes by calling the
# ``.to_item()`` method of ``ProductImagePage``.
image_item: Image

@field
def name(self) -> str:
return self.css("h1.name ::text").get("")

@field
def image(self) -> Image:
return self.image_item

class MySpider(scrapy.Spider):
name = "myspider"

def start_requests(self):
yield scrapy.Request(
"https://example.com/products/some-product", self.parse
)

# ✨ NEW: Notice that we're directly using the item here and not the
# page object.
def parse(self, response: DummyResponse, item: Product):
return item


In line with this, the following new features were made:

* Added a new :class:`scrapy_poet.page_input_providers.ItemProvider` which
makes the usage above possible.

* An item class is now supported by :func:`scrapy_poet.callback_for`
alongside the usual page objects. This means that it won't raise a
:class:`TypeError` anymore when not passing a subclass of
:class:`web_poet.pages.ItemPage`.

* New exception: :class:`scrapy_poet.injection_errors.ProviderDependencyDeadlockError`.
This is raised when it's not possible to create the dependencies due to
a deadlock in their sub-dependencies, e.g. due to a circular dependency
between page objects.

* Moved some of the utility functions from the test module into
``scrapy_poet.utils.testing``.

* Documentation improvements.

* Deprecations:

* The ``SCRAPY_POET_OVERRIDES`` setting has been replaced by
``SCRAPY_POET_RULES``.

* Backward incompatible changes:

* Overriding the default registry used via ``SCRAPY_POET_OVERRIDES_REGISTRY``
is not possible anymore.

* The following type aliases have been removed:

* ``scrapy_poet.overrides.RuleAsTuple``
* ``scrapy_poet.overrides.RuleFromUser``

* The :class:`scrapy_poet.page_input_providers.PageObjectInputProvider` base
class has these changes:

* It now accepts an instance of :class:`scrapy_poet.injection.Injector`
in its constructor instead of :class:`scrapy.crawler.Crawler`. Although
you can still access the :class:`scrapy.crawler.Crawler` via the
``Injector.crawler`` attribute.

* :meth:`scrapy_poet.page_input_providers.PageObjectInputProvider.is_provided`
is now an instance method instead of a class method.

* The :class:`scrapy_poet.injection.Injector`'s attribute and constructor
parameter called ``overrides_registry`` is now simply called ``registry``.

* The ``scrapy_poet.overrides`` module which contained ``OverridesRegistryBase``
and ``OverridesRegistry`` has now been removed. Instead, scrapy-poet directly
uses :class:`web_poet.rules.RulesRegistry`.

Everything should pretty much the same except for
:meth:`web_poet.rules.RulesRegistry.overrides_for` now accepts :class:`str`,
:class:`web_poet.page_inputs.http.RequestUrl`, or
:class:`web_poet.page_inputs.http.ResponseUrl` instead of
:class:`scrapy.http.Request`.

* This also means that the registry doesn't accept tuples as rules anymore.
Only :class:`web_poet.rules.ApplyRule` instances are allowed. The same goes
for ``SCRAPY_POET_RULES`` (and the deprecated ``SCRAPY_POET_OVERRIDES``).


0.8.0 (2023-01-24)
------------------

Expand Down
8 changes: 1 addition & 7 deletions docs/api_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ API
Injection Middleware
====================

.. automodule:: scrapy_poet.middleware
.. automodule:: scrapy_poet.downloadermiddlewares
:members:

Page Input Providers
Expand Down Expand Up @@ -43,9 +43,3 @@ Injection errors

.. automodule:: scrapy_poet.injection_errors
:members:

Overrides
=========

.. automodule:: scrapy_poet.overrides
:members:
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ To get started, see :ref:`intro-install` and :ref:`intro-tutorial`.
:caption: Advanced
:maxdepth: 1

overrides
rules-from-web-poet
providers
testing

Expand Down
50 changes: 16 additions & 34 deletions docs/intro/basic-tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -414,17 +414,17 @@ The spider won't work anymore after the change. The reason is that it
is using the new base Page Objects and they are empty.
Let's fix it by instructing ``scrapy-poet`` to use the Books To Scrape (BTS)
Page Objects for URLs belonging to the domain ``toscrape.com``. This must
be done by configuring ``SCRAPY_POET_OVERRIDES`` into ``settings.py``:
be done by configuring ``SCRAPY_POET_RULES`` into ``settings.py``:

.. code-block:: python

"SCRAPY_POET_OVERRIDES": [
"SCRAPY_POET_RULES": [
("toscrape.com", BTSBookListPage, BookListPage),
("toscrape.com", BTSBookPage, BookPage)
]

The spider is back to life!
``SCRAPY_POET_OVERRIDES`` contain rules that overrides the Page Objects
``SCRAPY_POET_RULES`` contain rules that overrides the Page Objects
used for a particular domain. In this particular case, Page Objects
``BTSBookListPage`` and ``BTSBookPage`` will be used instead of
``BookListPage`` and ``BookPage`` for any request whose domain is
Expand Down Expand Up @@ -465,16 +465,18 @@ to implement new ones:

The last step is configuring the overrides so that these new Page Objects
are used for the domain
``bookpage.com``. This is how ``SCRAPY_POET_OVERRIDES`` should look like into
``bookpage.com``. This is how ``SCRAPY_POET_RULES`` should look like into
``settings.py``:

.. code-block:: python

"SCRAPY_POET_OVERRIDES": [
("toscrape.com", BTSBookListPage, BookListPage),
("toscrape.com", BTSBookPage, BookPage),
("bookpage.com", BPBookListPage, BookListPage),
("bookpage.com", BPBookPage, BookPage)
from web_poet import ApplyRule

"SCRAPY_POET_RULES": [
ApplyRule("toscrape.com", use=BTSBookListPage, instead_of=BookListPage),
ApplyRule("toscrape.com", use=BTSBookPage, instead_of=BookPage),
ApplyRule("bookpage.com", use=BPBookListPage, instead_of=BookListPage),
ApplyRule("bookpage.com", use=BPBookPage, instead_of=BookPage)
]

The spider is now ready to extract books from both sites 😀.
Expand All @@ -490,27 +492,6 @@ for a particular domain, but more complex URL patterns are also possible.
For example, the pattern ``books.toscrape.com/cataloge/category/``
is accepted and it would restrict the override only to category pages.

It is even possible to configure more complex patterns by using the
:py:class:`web_poet.rules.ApplyRule` class instead of a triplet in
the configuration. Another way of declaring the earlier config
for ``SCRAPY_POET_OVERRIDES`` would be the following:

.. code-block:: python

from url_matcher import Patterns
from web_poet import ApplyRule


SCRAPY_POET_OVERRIDES = [
ApplyRule(for_patterns=Patterns(["toscrape.com"]), use=BTSBookListPage, instead_of=BookListPage),
ApplyRule(for_patterns=Patterns(["toscrape.com"]), use=BTSBookPage, instead_of=BookPage),
ApplyRule(for_patterns=Patterns(["bookpage.com"]), use=BPBookListPage, instead_of=BookListPage),
ApplyRule(for_patterns=Patterns(["bookpage.com"]), use=BPBookPage, instead_of=BookPage),
]

As you can see, this could get verbose. The earlier tuple config simply offers
a shortcut to be more concise.

.. note::

Also see the `url-matcher <https://url-matcher.readthedocs.io/en/stable/>`_
Expand All @@ -530,11 +511,11 @@ and store the :py:class:`web_poet.rules.ApplyRule` for you. All of the
# rules from other packages. Otherwise, it can be omitted.
# More info about this caveat on web-poet docs.
consume_modules("external_package_A", "another_ext_package.lib")
SCRAPY_POET_OVERRIDES = default_registry.get_rules()
SCRAPY_POET_RULES = default_registry.get_rules()

For more info on this, you can refer to these docs:

* ``scrapy-poet``'s :ref:`overrides` Tutorial section.
* ``scrapy-poet``'s :ref:`rules-from-web-poet` Tutorial section.
* External `web-poet`_ docs.

* Specifically, the :external:ref:`rules-intro` Tutorial section.
Expand All @@ -545,7 +526,8 @@ Next steps
Now that you know how ``scrapy-poet`` is supposed to work, what about trying to
apply it to an existing or new Scrapy project?

Also, please check the :ref:`overrides` and :ref:`providers` sections as well as
refer to spiders in the "example" folder: https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders
Also, please check the :ref:`rules-from-web-poet` and :ref:`providers` sections
as well as refer to spiders in the "example" folder:
https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders

.. _Scrapy Tutorial: https://docs.scrapy.org/en/latest/intro/tutorial.html
Loading