Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@handle_urls() with item return type #84

Merged
merged 48 commits into from
Oct 27, 2022
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
aa8698c
update handle_urls() and OverrideRule to accept data types
BurnzZ Sep 28, 2022
cc5ec17
update flake8 to ignore D102 in tests/po_lib_data_type
BurnzZ Sep 28, 2022
c135b2d
update mypy config to ignore tests.po_lib_data_type
BurnzZ Sep 28, 2022
7837b3c
rename 'data_type' into 'to_return' in handle_urls() and OverrideRule
BurnzZ Oct 3, 2022
eb7e1b2
rename 'overrides' into 'instead_of' in @handle_urls
BurnzZ Oct 3, 2022
904dd8c
rename and deprecate OverrideRule into ApplyRule
BurnzZ Oct 3, 2022
03f5656
fix tests
BurnzZ Oct 3, 2022
3ed8415
update CHANGELOG with regards to ApplyRule and @handle_urls changes
BurnzZ Oct 3, 2022
3a5499c
create from_apply_rules method in PageObjectRegistry; deprecate from_…
BurnzZ Oct 3, 2022
01112e7
rename 'web_poet.overrides' into 'web_poet.rules'
BurnzZ Oct 3, 2022
b1c7e14
rename PageObjectRegistry's methods: get_overrides → get_rules, searc…
BurnzZ Oct 3, 2022
1afddc9
import * from 'rules' in 'overrides'
BurnzZ Oct 3, 2022
144e39e
prioritize 'to_return' parameter compared to derived item_cls
BurnzZ Oct 3, 2022
fee63a5
fix the deprecated 'overrides' parameter not being used if present
BurnzZ Oct 6, 2022
33a48ac
enable auto-conversion to url_matcher.Patterns on ApplyRules.for_patt…
BurnzZ Oct 6, 2022
9b6f9c4
update all arguments of ApplyRule to be keyword-only except 'for_patt…
BurnzZ Oct 7, 2022
8264565
fix mypy issue in ApplyRule tests
BurnzZ Oct 7, 2022
3287881
improve tests
BurnzZ Oct 14, 2022
c5faf38
update docstrings/tutorials regarding the new 'to_return' parameter
BurnzZ Oct 14, 2022
9729e8f
clean-up CHANGELOG formatting
BurnzZ Oct 14, 2022
8efa813
update CHANGELOG to soften the value of the 'to_return' param
BurnzZ Oct 14, 2022
52a6f47
update override docs to change the tone about the 'to_return' parameter
BurnzZ Oct 14, 2022
2cb518b
Apply naming and grammar suggestions
BurnzZ Oct 14, 2022
88c511d
test improvements
BurnzZ Oct 14, 2022
928188e
remove 'preferred' param of get_item_cls()
BurnzZ Oct 14, 2022
3679b59
update default behavior of @handle_urls to return dict instead of None
BurnzZ Oct 14, 2022
fc0ba50
improve the docstring of handle_urls()
BurnzZ Oct 14, 2022
5967e34
update docs by removing tick mark chars in anchors
BurnzZ Oct 14, 2022
0626e57
rename some *.com URLs into *.example in docs and tests
BurnzZ Oct 17, 2022
5fdf4a1
Merge branch 'master' of ssh://github.com/scrapinghub/web-poet into h…
BurnzZ Oct 17, 2022
de15a86
revert default 'to_return=dict' and use 'None' instead
BurnzZ Oct 18, 2022
f354c4a
improve docs and code comments
BurnzZ Oct 18, 2022
bce97be
improve docstring of 'search_rules()'
BurnzZ Oct 19, 2022
4e00ea8
Improve the docs
BurnzZ Oct 25, 2022
59381a5
add reference link to Page Objects in Overrides tutorial
BurnzZ Oct 25, 2022
076e7bb
remove mention of 'to_return' in @handle_url doc examples
BurnzZ Oct 26, 2022
1419f7a
improve tests
BurnzZ Oct 26, 2022
776cf0d
improve docstrings and warning messages
BurnzZ Oct 26, 2022
42bd123
Fix test case when ensuring that ApplyRule is frozen
BurnzZ Oct 26, 2022
36cd866
update tests to check each param change on hash()
BurnzZ Oct 26, 2022
383b4f7
update 'Item Class' to 'item class'
BurnzZ Oct 26, 2022
4755ce9
add an Overview section to the Overrides docs; rename them to Apply R…
kmike Oct 26, 2022
7240c9e
bump Sphinx version
kmike Oct 26, 2022
85b9b7b
simplify ApplyRule docstring
kmike Oct 26, 2022
ddaed2f
typo fix
kmike Oct 26, 2022
660a8cd
Apply suggestions from code review
kmike Oct 27, 2022
3e23c51
mention str -> Patterns conversion
kmike Oct 27, 2022
2c570e2
Merge pull request #90 from scrapinghub/handle_urls-docs
kmike Oct 27, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -34,5 +34,7 @@ per-file-ignores =
# imports are there to expose submodule functions so they can be imported
# directly from that module
# F403: Ignore * imports in these files
# D102: Missing docstring in public method
web_poet/__init__.py:F401,F403
web_poet/page_inputs/__init__.py:F401,F403
tests/po_lib_to_return/__init__.py:D102
36 changes: 36 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,42 @@
Changelog
=========

TBD
---

* New ``ApplyRule`` class created by the ``@handle_urls`` decorator. This is
nearly identical with ``OverrideRule`` except:

* It's now accepting a ``to_return`` parameter which signifies the data
container class that the Page Object returns.
* Passing a string to ``for_patterns`` would auto-convert it into
``url_matcher.Patterns``.
* All arguments are now keyword-only except for ``for_patterns``.

* Modify the call signature and behavior of ``handle_urls``:

* New ``instead_of`` parameter which does the same thing as ``overrides``.
* The old ``overrides`` parameter is not required anymore as it's set for
deprecation.
* It sets a ``to_return`` parameter when creating ``ApplyRule`` based on the
declared item class in subclasses of ``web_poet.ItemPage``. It's also
possible to pass a ``to_return`` parameter on more advanced use cases.

* Documentation, test, and warning message improvements.

Deprecations:

* The ``overrides`` parameter from ``@handle_urls`` is now deprecated.
Use the ``instead_of`` parameter instead.
* The ``OverrideRule`` class is now deprecated. Use ``ApplyRule`` instead.
* The ``from_override_rules`` method of ``PageObjectRegistry`` is now deprecated.
Use ``from_apply_rules`` instead.
* The ``web_poet.overrides`` module is deprecated. Use ``web_poet.rules`` instead.
* The ``PageObjectRegistry.get_overrides`` method is deprecated.
Use ``PageObjectRegistry.get_rules`` instead.
* The ``PageObjectRegistry.search_overrides`` method is deprecated.
Use ``PageObjectRegistry.search_rules`` instead.

0.5.1 (2022-09-23)
------------------

Expand Down
14 changes: 7 additions & 7 deletions docs/advanced/additional-requests.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.. _`advanced-requests`:
.. _advanced-requests:

===================
Additional Requests
Expand Down Expand Up @@ -27,7 +27,7 @@ The key words "MUST”, "MUST NOT”, "REQUIRED”, "SHALL”, "SHALL NOT”, "S
"SHOULD NOT”, "RECOMMENDED”, "MAY”, and "OPTIONAL” in this document are to be
interpreted as described in RFC `2119 <https://www.ietf.org/rfc/rfc2119.txt>`_.

.. _`httprequest-example`:
.. _httprequest-example:

HttpRequest
===========
Expand Down Expand Up @@ -271,7 +271,7 @@ The key take aways for this example are:
available.


.. _`httpclient`:
.. _httpclient:

HttpClient
==========
Expand Down Expand Up @@ -337,7 +337,7 @@ additional requests using the :meth:`~.HttpClient.request`, :meth:`~.HttpClient.
and :meth:`~.HttpClient.post` methods of :class:`~.HttpClient`. These already
define the :class:`~.HttpRequest` and executes it as well.

.. _`httpclient-get-example`:
.. _httpclient-get-example:

A simple ``GET`` request
------------------------
Expand Down Expand Up @@ -376,7 +376,7 @@ There are a few things to take note in this example:
* There is no need create an instance of :class:`~.HttpRequest` when
:meth:`~.HttpClient.get` is used.

.. _`request-post-example`:
.. _request-post-example:

A ``POST`` request with `header` and `body`
-------------------------------------------
Expand Down Expand Up @@ -459,7 +459,7 @@ quick shortcuts for :meth:`~.HttpClient.request`:
Thus, apart from the common ``GET`` and ``POST`` HTTP methods, you can use
:meth:`~.HttpClient.request` for them (`e.g.` ``HEAD``, ``PUT``, ``DELETE``, etc).

.. _`http-batch-request-example`:
.. _http-batch-request-example:

Batch requests
--------------
Expand Down Expand Up @@ -567,7 +567,7 @@ The key takeaways for this example are:
first response from a group of requests as early as possible. However, the
order could be shuffled.

.. _`exception-handling`:
.. _exception-handling:

Handling Exceptions in Page Objects
===================================
Expand Down
32 changes: 17 additions & 15 deletions docs/advanced/fields.rst
Original file line number Diff line number Diff line change
Expand Up @@ -179,11 +179,13 @@ It's also possible to implement field cleaning and processing in ``to_item``
but in that case accessing a field directly will return the value without
processing, so it's preferable to use field processors instead.

Item classes
.. _item-classes:

Item Classes
------------

In all previous examples, ``to_item`` methods are returning ``dict``
instances. It is common to use item classes (e.g. dataclasses or
instances. It is common to use Item Classes (e.g. dataclasses or
attrs instances) instead of unstructured dicts to hold the data:

.. code-block:: python
Expand All @@ -207,29 +209,29 @@ attrs instances) instead of unstructured dicts to hold the data:
)

:mod:`web_poet.fields` supports it, by allowing to parametrize
:class:`~.ItemPage` with an item class:
:class:`~.ItemPage` with an Item Class:

.. code-block:: python

@attrs.define
class ProductPage(ItemPage[Product]):
# ...

When :class:`~.ItemPage` is parametrized with an item class,
When :class:`~.ItemPage` is parametrized with an Item Class,
its ``to_item()`` method starts to return item instances, instead
of ``dict`` instances. In the example above ``ProductPage.to_item`` method
returns ``Product`` instances.

Defining an Item class may be an overkill if you only have a single Page Object,
but item classes are of a great help when
Defining an Item Class may be an overkill if you only have a single Page Object,
but Item Classes are of a great help when

* you need to extract data in the same format from multiple websites, or
* if you want to define the schema upfront.

Error prevention
~~~~~~~~~~~~~~~~

Item classes play particularly well with the
Item Classes play particularly well with the
:func:`@field <web_poet.fields.field>` decorator, preventing some of the errors,
which may happen if results are plain "dicts".

Expand All @@ -254,7 +256,7 @@ Consider the following badly written page object:
def nane(self):
return self.response.css(".name").get()

Because the ``Product`` item class is used, a typo ("nane" instead of "name")
Because the ``Product`` Item Class is used, a typo ("nane" instead of "name")
is detected at runtime: the creation of a ``Product`` instance would fail with
a ``TypeError``, because of the unexpected keyword argument "nane".

Expand All @@ -263,10 +265,10 @@ detected: the ``price`` argument is required, but there is no extraction method
this attribute, so ``Product.__init__`` will raise another ``TypeError``,
indicating that a required argument is missing.

Without an item class, none of these errors are detected.
Without an Item Class, none of these errors are detected.

Changing Item type
~~~~~~~~~~~~~~~~~~
Changing Item Class
~~~~~~~~~~~~~~~~~~~

Let's say there is a Page Object implemented, which outputs some standard
item. Maybe there is a library of such Page Objects available. But for a
Expand Down Expand Up @@ -333,7 +335,7 @@ to the item:
# ...

Note how :class:`~.Returns` is used as one of the base classes of
``CustomFooPage``; it allows to change the item type returned by a page object.
``CustomFooPage``; it allows to change the Item Class returned by a page object.

Removing fields (as well as renaming) is a bit more tricky.

Expand All @@ -344,7 +346,7 @@ inherit from the "base", "standard" Page Object, there could be a ``@field``
from the base class which is not present in the ``CustomItem``.
It'd be still passed to ``CustomItem.__init__``, causing an exception.

One way to solve it is to make the orignal Page Object a dependency
One way to solve it is to make the original Page Object a dependency
instead of inheriting from it, as explained in the beginning.

Alternatively, you can use ``skip_nonitem_fields=True`` class argument - it tells
Expand All @@ -368,13 +370,13 @@ is passed, and ``name`` is the only field ``CustomItem`` supports.

To recap:

* Use ``Returns[NewItemType]`` to change the item type in a subclass.
* Use ``Returns[NewItemType]`` to change the Item Class in a subclass.
* Don't use ``skip_nonitem_fields=True`` when your Page Object corresponds
to an item exactly, or when you're only adding fields. This is a safe
approach, which allows to detect typos in field names, even for optional
fields.
* Use ``skip_nonitem_fields=True`` when it's possible for the Page Object
to contain more ``@fields`` than defined in the item class, e.g. because
to contain more ``@fields`` than defined in the Item Class, e.g. because
Page Object is inherited from some other base Page Object.

Caching
Expand Down
6 changes: 3 additions & 3 deletions docs/api-reference.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.. _`api-reference`:
.. _api-reference:

=============
API Reference
Expand Down Expand Up @@ -81,7 +81,7 @@ Exceptions
:show-inheritance:
:members:

.. _`api-overrides`:
.. _api-overrides:

Overrides
=========
Expand All @@ -91,7 +91,7 @@ use cases and some examples.

.. autofunction:: web_poet.handle_urls

.. automodule:: web_poet.overrides
.. automodule:: web_poet.rules
:members:
:exclude-members: handle_urls

Expand Down
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ and the motivation behind ``web-poet``, start with :ref:`from-ground-up`.
changelog
license

.. _`web-poet`: https://github.com/scrapinghub/web-poet
.. _web-poet: https://github.com/scrapinghub/web-poet
.. _Scrapy: https://scrapy.org/
.. _scrapy-poet: https://github.com/scrapinghub/scrapy-poet

2 changes: 1 addition & 1 deletion docs/intro/from-ground-up.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.. _`from-ground-up`:
.. _from-ground-up:

===========================
web-poet from the ground up
Expand Down
Loading