Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch wikilinks redis requests on article parsing #1719

Merged
merged 1 commit into from
Jan 9, 2023

Conversation

uriesk
Copy link
Collaborator

@uriesk uriesk commented Jan 7, 2023

Instead of checking every wikilink one-by-one if it is going to be fetched, batch all links of an article together.
Use HEXISTS instead of HGET for checking the existence of a field.

I did some basic benchmarks with smaller wikis and it is about 5% faster at article scrapping phase. I think only the zimfarm on larger wikis can tell us the actual impact.

closes #1718

@uriesk uriesk marked this pull request as draft January 7, 2023 21:14
@uriesk uriesk self-assigned this Jan 7, 2023
@codecov
Copy link

codecov bot commented Jan 7, 2023

Codecov Report

Base: 68.96% // Head: 69.52% // Increases project coverage by +0.56% 🎉

Coverage data is based on head (58d5a26) compared to base (e10d600).
Patch coverage: 78.94% of modified lines in pull request are covered.

❗ Current head 58d5a26 differs from pull request most recent head eaac7d1. Consider uploading reports for the commit eaac7d1 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1719      +/-   ##
==========================================
+ Coverage   68.96%   69.52%   +0.56%     
==========================================
  Files          26       26              
  Lines        2436     2468      +32     
  Branches      477      483       +6     
==========================================
+ Hits         1680     1716      +36     
+ Misses        591      585       -6     
- Partials      165      167       +2     
Impacted Files Coverage Δ
src/util/categories.ts 5.03% <0.00%> (ø)
src/util/RedisKvs.ts 77.46% <76.47%> (+3.55%) ⬆️
src/util/rewriteUrls.ts 80.40% <80.00%> (+5.84%) ⬆️
src/util/saveArticles.ts 82.92% <100.00%> (-0.21%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@kelson42
Copy link
Collaborator

kelson42 commented Jan 7, 2023

@uriesk 5% is quite good actually. Don't forget to remove draft when you are so far.

@uriesk uriesk force-pushed the pipelining branch 2 times, most recently from 58d5a26 to 65509ad Compare January 7, 2023 21:46
@uriesk
Copy link
Collaborator Author

uriesk commented Jan 7, 2023

@kelson42 i was hoping for more. It's hard to benchmark because the bandwidth fluctuates by more than those 5%

got a question
When a redirect gets found, it links to the original, it was the same in the old code.
new:
https://github.com/uriesk/mwoffliner/blob/pipelining/src/util/rewriteUrls.ts#L247
old:
https://github.com/openzim/mwoffliner/blob/main/src/util/rewriteUrls.ts#L24
This is deliberately, right?

Edit:
We can open a different issue in the future if that behavior isn't right, i keep it like it was before.

@uriesk uriesk force-pushed the pipelining branch 2 times, most recently from af568c6 to 922f20c Compare January 7, 2023 22:53
@uriesk
Copy link
Collaborator Author

uriesk commented Jan 7, 2023

This looks good now.
I hope we can try a larger wiki with the zimfarm with those changes soon and see if things improved :)

@uriesk uriesk marked this pull request as ready for review January 7, 2023 23:09
Copy link
Collaborator

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made a superficial review. It seems to work but I see no speedup at all. That said, the principle is better. I think, and this would be something additional, that each function would benefit of a small comment explaining for what it's for, in particular rewriteUrlNoArticleCheck().

Quite a bit of code has been moved/changed, just hope that nothing fondamental in the link treatment has been changed. That only the architecturing has changed.

If OK for @pavel-karatsiuba, I would merge it and give it a try, but don't really expect much impact anyway.

@uriesk
Copy link
Collaborator Author

uriesk commented Jan 8, 2023

Thought so.
I am running it in an environment with limited CPU speed but good bandwidth and have to time the article downloading phase specifically to see a tiny improvement.

That being said, spaming less requests to redis is always a good idea and it's one potential issue out of the way.

rewriteUrlNoArticleCheck() is the same as rewriteURL was before, just some ifs flattened, mediaDependencies in the arguments and wikilink parsing at the end moved out.
I can do elaborate comments when the switch to eslint is happening. There is lots of potential refactoring going to happen, with how touchy eslint can be.

@kelson42 kelson42 changed the title batch wikilinks redis requests on article parsing Batch wikilinks redis requests on article parsing Jan 8, 2023
@kelson42
Copy link
Collaborator

kelson42 commented Jan 8, 2023

@pavel-karatsiuba works on #1699 which should be ready very soon.

@kelson42 kelson42 merged commit 85763ea into openzim:main Jan 9, 2023
@uriesk uriesk deleted the pipelining branch January 10, 2023 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Batch isMirrored redis requests on article parsing
3 participants