-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clearing Redis media storage after each dump #1826
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1826 +/- ##
==========================================
+ Coverage 70.91% 71.40% +0.49%
==========================================
Files 24 24
Lines 2623 2623
Branches 595 594 -1
==========================================
+ Hits 1860 1873 +13
+ Misses 657 644 -13
Partials 106 106
☔ View full report in Codecov by Sentry. |
@pavel-karatsiuba please explain clearly the nature of the problem and the nature of the solution and its consequences. Why you need a new redis store? |
All paths to files are stored in the Redis store before downloading. Then the script is reading file paths one by one from Redis and downloads them. If we use a parameter In the case if we use two '--format' parameters then all media files are filtered with the first '--format' parameter. And paths for filtered media will be stored in Redis storage. Before using the next '--format' parameter we are not clearing Redis storage and all media data from the first iteration will be used also for the second iteration. For example, the page contains pdf and jpg files. We are using a script with parameters My fix is clearing the Redis store after each dump. And I used an additional Redis store to separate media files and other files for example CSS. So my solution removes only paths to media files but does not remove paths to CSS. |
Why we I need a new Redis store: |
d02a1fd
to
e41062e
Compare
@pavel-karatsiuba Thank you for your explanations, the problem is clear now to me and the solution seems appropriate. This is a major bug and thank you very much for detecting and finxing it. One point is still unclear to me. Populating the medias happens based on the parsing of HTML, so if we remove everything, then the CSS should be repopulated?! I also don't understand why you called somethin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still few points to clarify/fix.
For storing CSS I used storage with the "Files" prefix. For media files, I used storage with the "Media" prefix. |
@pavel-karatsiuba You don’t have answered to one of my questions. how article subtitles do impact media scraping? Actually what are subtitles? What about the js dependencies? I don’t really see why we should have a new redis storage, because either it has to be refreshed (which is the case of css/js IMO) for each article (for each —format) or it does not have to and then it should belong the details redis store. |
The first thing I made is clearing the Redis store which contains media files and CSS after finishing the current iteration. On the next iteration media files were repopulated but not CSS. In the source code, I see a condition that is not allowing to do this: https://github.com/openzim/mwoffliner/blob/main/src/util/dump.ts#L61 Subtitles are files in |
I understand what you mean and changed the logic. Right now storage with media files and CSS will be cleared after each iteration. |
Now I remember, subtitles need clearly to be removed after each dump indeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments
Clearing redis media storage after each dump
bfec274
to
082c542
Compare
082c542
to
a1c0eb7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Clearing Redis media storage after each dump
fixes: #1825