-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes to download articles thumbnails #1794
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1794 +/- ##
==========================================
+ Coverage 70.42% 70.83% +0.40%
==========================================
Files 23 23
Lines 2597 2582 -15
Branches 594 591 -3
==========================================
Hits 1829 1829
+ Misses 661 648 -13
+ Partials 107 105 -2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always put a PR description with:
- explanation about the problem (no code details)
- explanation about solution (no code details)
- impacts (for users, devs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function handles many different details. The thumbnail is only one of them. I wonder why we need to handle the thumbnail in a special manner?!
In 2019 was added code with filtering of the details field in the response. After filtering the details contains only 'subCategories', 'revisions', and 'redirects' fields and in some cases can be also 'thumbnail' and others. I returned the 'thumbnail' option to the details response. I made fixes with minimum impact to not touch other functionality. Maybe right now we also have a problem with other fields which were removed from the response: 'coordinates' and 'categories'. |
5d5ec92
to
03c29dc
Compare
@pavel-karatsiuba Thank you for the explanation. Actually this is not the first time we have a problem around this (not all options been "continued"). This is a serious problem because the impact is really subtil and we don't see it immediatly. Please work on this PR so all details requested are automatically continued and that we won't create a new problem by adding or removing or renaming one of this detail. I'm also concerned by the memory usage of the retrieving of these image URL (representing an article) because:
Therefore I believe we should isolate the retrieval of these image URL in a dedicated method (working similarly than the one you have modified) but which would:
|
@kelson42 |
I didn't say this is a problem with this PR, I said there is a concern about the memory usage regarding how we proceed. I don't have talked anytime about execution time/speed. Think about it like this. on WPEN, currently, we are retrieving a tile image URL for almost 7.000.000 articles (potentially), but in best case we use only 100 of them.... and actually most probably none of them! One URL is 100 bytes in average, so make 80x7.000.000 bytes=560MB. This is my concern.
Yes,
I don't have written about Thumbnails, I have written about "image URL".
That was clear. Please reconsider the problem and the solution "Therefore I believe we should isolate the retrieval of these image URL in a dedicated method" |
URLs for images which you can see on the welcome page are built from thumbnails URL. So we don't have additional data for images in details object. Also, we have strict restrictions about retrieving thumbnail data with details. We are requesting thumbnail data only for 100 articles. So in response, we will get additional fields with thumbnail data only for 100 articles. I think this PR is fixing the bug from the ticket. I propose to create a new ticket if we need to make something else. |
… retrieves data from API with continuation.
d5f7a24
to
f70e9b1
Compare
You can see random article thumbnails on the main page of ZIM with the top 1 million Wikipedia pages. But these articles should be sorted in order from the most popular.
The problem in the process of retrieving article details. In most cases, mwoffliner cant get articles details from API with one request. Wikimedia API provides a mechanism with which we need to send multiple requests to retrieve all articles details.
Mwoffliner takes thumbnails only for articles that are retrieved with one request. But for articles that retrieve with multiple requests, the mwoffliner does not take thumbnails.
The mwoffliner has a piece of code that allows us to retrieve thumbnails of the articles with multiple requests, but this code was not used for a while.
With this PR I am turning on a mechanism to take thumbnails for articles that are retrieved with multiple requests.
These changes can't be affected much by the performance because it does not generate additional API requests.
fixes: #1786