Skip to content

Commit

Permalink
Don't fail if a feed has two entries with the same id
Browse files Browse the repository at this point in the history
or two parallel updates add the same entry. #335
  • Loading branch information
lemon24 committed May 26, 2024
1 parent 4c37b26 commit 788ddff
Show file tree
Hide file tree
Showing 4 changed files with 33 additions and 6 deletions.
7 changes: 7 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,12 @@ Version 3.13

Unreleased

* Fix bug introduced in `version 3.12 <Version 3.12_>`_ causing an assertion error
when there are multiple entries with the same id in the same feed,
or when parallel :meth:`~Reader.update_feeds` calls add the same entry.
The fix restores the pre-3.12 first-entry-wins / last-write-wins behavior.
Thanks to `Joakim Hellsén`_ for reporting and helping debug this issue.
(:issue:`335`)
* Make it possible to re-run the :mod:`~reader.plugins.mark_as_read` plugin
for existing entries.
Thanks to `Michael Han`_ for the pull request.
Expand All @@ -19,6 +25,7 @@ Unreleased
``--new-only`` remains available as an alias.
(:issue:`334`)

.. _Joakim Hellsén: https://github.com/TheLovinator1
.. _Michael Han: https://github.com/Sakunam


Expand Down
14 changes: 11 additions & 3 deletions docs/dev.rst
Original file line number Diff line number Diff line change
Expand Up @@ -713,10 +713,18 @@ OPML support
Thoughts on dynamic lists of feeds: :issue:`165#issuecomment-905893420`.


entry_dedupe
~~~~~~~~~~~~
Duplicate entries
~~~~~~~~~~~~~~~~~

Duplicate entries are mainly handled by the :mod:`reader.entry_dedupe` plugin.

* Using MinHash to speed up similarity checks (maybe): https://gist.github.com/lemon24/b9af5ade919713406bda9603847d32e5

Using MinHash to speed up similarity checks (maybe): https://gist.github.com/lemon24/b9af5ade919713406bda9603847d32e5
However, it is also possible for a feed to have two entries with the same id
– yes, even though in most (if not all) formats,
the id is meant to be *universally unique*!
As of 3.13, we do not support multiple entries with the same id
(the first entry wins); see :issue:`335` for thoughts on this.


REST API
Expand Down
12 changes: 10 additions & 2 deletions src/reader/_storage/_entries.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,13 +209,23 @@ def _add_or_update_entries(self, intents: Iterable[EntryUpdateIntent]) -> None:
with self.get_db() as db:
try:
for intent in intents:
# we cannot rely on the EntryForUpdate the updater got,
# the entry may have been added by different update since then
new = not list(
db.execute(
"SELECT 1 FROM entries WHERE (feed, id) = (?, ?)",
intent.entry.resource_id,
)
)
# if the entry is new, it must have been new for the updater too;
# there's still a race condition when the entry was not new
# to the updater, but was deleted before we got here;
# this can be fixed by not making first_updated{_epoch} modal
# (return on EntryForUpdate, make required on intent)
# TODO: round-trip first_updated{_epoch} after #332 is merged
if new:
assert intent.first_updated is not None, intent
assert intent.first_updated_epoch is not None, intent
self._insert_entry(db, intent)
else:
self._update_entry(db, intent)
Expand Down Expand Up @@ -272,8 +282,6 @@ def _insert_entry(self, db: sqlite3.Connection, intent: EntryUpdateIntent) -> No
db.execute(query, entry_update_intent_to_dict(intent))

def _update_entry(self, db: sqlite3.Connection, intent: EntryUpdateIntent) -> None:
assert intent.first_updated is None, intent.first_updated
assert intent.first_updated_epoch is None, intent.first_updated_epoch
query = """
UPDATE entries
SET
Expand Down
6 changes: 5 additions & 1 deletion tests/test_reader_update.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,8 @@ def parse(url, result):


@pytest.mark.slow
def test_concurrent_update(monkeypatch, db_path, make_reader):
@pytest.mark.parametrize('new', [True, False])
def test_concurrent_update(monkeypatch, db_path, make_reader, new):
"""If a feed is updated in parallel, the last writer wins.
This is the temporal equivalent of test_duplicate_ids().
Expand All @@ -57,6 +58,9 @@ def test_concurrent_update(monkeypatch, db_path, make_reader):
reader._parser = parser = Parser()

reader.add_feed(parser.feed(1))
if not new:
entry = parser.entry(1, 1, title='zero')
reader.update_feeds()
entry = parser.entry(1, 1, title='one')

in_make_intents = threading.Event()
Expand Down

0 comments on commit 788ddff

Please sign in to comment.