Batch wikilinks redis requests on article parsing #1719

uriesk · 2023-01-07T21:13:44Z

Instead of checking every wikilink one-by-one if it is going to be fetched, batch all links of an article together.
Use HEXISTS instead of HGET for checking the existence of a field.

I did some basic benchmarks with smaller wikis and it is about 5% faster at article scrapping phase. I think only the zimfarm on larger wikis can tell us the actual impact.

closes #1718

codecov · 2023-01-07T21:22:11Z

Codecov Report

Base: 68.96% // Head: 69.52% // Increases project coverage by +0.56% 🎉

Coverage data is based on head (58d5a26) compared to base (e10d600).
Patch coverage: 78.94% of modified lines in pull request are covered.

❗ Current head 58d5a26 differs from pull request most recent head eaac7d1. Consider uploading reports for the commit eaac7d1 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1719      +/-   ##
==========================================
+ Coverage   68.96%   69.52%   +0.56%     
==========================================
  Files          26       26              
  Lines        2436     2468      +32     
  Branches      477      483       +6     
==========================================
+ Hits         1680     1716      +36     
+ Misses        591      585       -6     
- Partials      165      167       +2

Impacted Files	Coverage Δ
src/util/categories.ts	`5.03% <0.00%> (ø)`
src/util/RedisKvs.ts	`77.46% <76.47%> (+3.55%)`	⬆️
src/util/rewriteUrls.ts	`80.40% <80.00%> (+5.84%)`	⬆️
src/util/saveArticles.ts	`82.92% <100.00%> (-0.21%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

kelson42 · 2023-01-07T21:31:42Z

@uriesk 5% is quite good actually. Don't forget to remove draft when you are so far.

uriesk · 2023-01-07T21:50:25Z

@kelson42 i was hoping for more. It's hard to benchmark because the bandwidth fluctuates by more than those 5%

got a question
When a redirect gets found, it links to the original, it was the same in the old code.
new:
https://github.com/uriesk/mwoffliner/blob/pipelining/src/util/rewriteUrls.ts#L247
old:
https://github.com/openzim/mwoffliner/blob/main/src/util/rewriteUrls.ts#L24
This is deliberately, right?

Edit:
We can open a different issue in the future if that behavior isn't right, i keep it like it was before.

uriesk · 2023-01-07T23:08:57Z

This looks good now.
I hope we can try a larger wiki with the zimfarm with those changes soon and see if things improved :)

kelson42

I have made a superficial review. It seems to work but I see no speedup at all. That said, the principle is better. I think, and this would be something additional, that each function would benefit of a small comment explaining for what it's for, in particular rewriteUrlNoArticleCheck().

Quite a bit of code has been moved/changed, just hope that nothing fondamental in the link treatment has been changed. That only the architecturing has changed.

If OK for @pavel-karatsiuba, I would merge it and give it a try, but don't really expect much impact anyway.

uriesk · 2023-01-08T14:25:20Z

Thought so.
I am running it in an environment with limited CPU speed but good bandwidth and have to time the article downloading phase specifically to see a tiny improvement.

That being said, spaming less requests to redis is always a good idea and it's one potential issue out of the way.

rewriteUrlNoArticleCheck() is the same as rewriteURL was before, just some ifs flattened, mediaDependencies in the arguments and wikilink parsing at the end moved out.
I can do elaborate comments when the switch to eslint is happening. There is lots of potential refactoring going to happen, with how touchy eslint can be.

kelson42 · 2023-01-08T15:12:34Z

@pavel-karatsiuba works on #1699 which should be ready very soon.

uriesk marked this pull request as draft January 7, 2023 21:14

uriesk self-assigned this Jan 7, 2023

uriesk force-pushed the pipelining branch 2 times, most recently from 58d5a26 to 65509ad Compare January 7, 2023 21:46

uriesk force-pushed the pipelining branch 2 times, most recently from af568c6 to 922f20c Compare January 7, 2023 22:53

uriesk marked this pull request as ready for review January 7, 2023 23:09

uriesk requested review from kelson42 and pavel-karatsiuba January 7, 2023 23:09

kelson42 approved these changes Jan 8, 2023

View reviewed changes

kelson42 changed the title ~~batch wikilinks redis requests on article parsing~~ Batch wikilinks redis requests on article parsing Jan 8, 2023

pavel-karatsiuba approved these changes Jan 9, 2023

View reviewed changes

batch wikilinks redis requests on article parsing together

eaac7d1

kelson42 force-pushed the pipelining branch from 922f20c to eaac7d1 Compare January 9, 2023 21:43

kelson42 merged commit 85763ea into openzim:main Jan 9, 2023

uriesk deleted the pipelining branch January 10, 2023 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch wikilinks redis requests on article parsing #1719

Batch wikilinks redis requests on article parsing #1719

uriesk commented Jan 7, 2023

codecov bot commented Jan 7, 2023 •

edited

Loading

kelson42 commented Jan 7, 2023

uriesk commented Jan 7, 2023 •

edited

Loading

uriesk commented Jan 7, 2023

kelson42 left a comment •

edited

Loading

uriesk commented Jan 8, 2023

kelson42 commented Jan 8, 2023

Batch wikilinks redis requests on article parsing #1719

Batch wikilinks redis requests on article parsing #1719

Conversation

uriesk commented Jan 7, 2023

codecov bot commented Jan 7, 2023 • edited Loading

Codecov Report

kelson42 commented Jan 7, 2023

uriesk commented Jan 7, 2023 • edited Loading

uriesk commented Jan 7, 2023

kelson42 left a comment • edited Loading

Choose a reason for hiding this comment

uriesk commented Jan 8, 2023

kelson42 commented Jan 8, 2023

codecov bot commented Jan 7, 2023 •

edited

Loading

uriesk commented Jan 7, 2023 •

edited

Loading

kelson42 left a comment •

edited

Loading