-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce normalize and normalization schema. #130
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly one "stupid" question.
I cannot really review the rest, it is too hard for me to follow where we are going. At least I don't see things which looks weird to me.
|
366fc13
to
7f1de56
Compare
We now use the And we need to update https://github.com/openzim/python-scraperlib/ as it is through this project that we get |
As we said last monday, we should probably release all this stuff so that the CI and us can test this correctly. And again, we should not merge this to |
@mgautierfr Please create a |
7f1de56
to
cdba15c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you ; given we're very close to a libzim release, I didn't go through the trouble of building a special pylibzim to run the tests.
Only a few minor comments on my side 👍
cdba15c
to
4654a55
Compare
Two "fixup" commits added in the git history. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good now 👍
Can we rely on new’y released pythin-libzim 3.4.0 and merge? |
No it depends on scraperlib |
@rgaudin Then openzim/python-scraperlib#116 is the blocker |
|
The CI still fails! It's not a problem with latest (python-)libzim. Probably nothing really bad, but we have to fix in priority to move forward. |
Just to double-check: has it really been decided that Headers should never be stored, given that they provide instructions to the browser about how to deal with Responses (see kiwix/overview#95 (comment))? In current warc2zim-generated archives, every Response is constructed from two lookups: a first-pass lookup to get the Response Header, and a second-pass lookup to get the Response Body. Only when both lookups have been done, does the backend have enough information to send a complete Response (mirroring that which was originally recorded) back to the browser. I fear (I hope the fear is unfounded, but I have some doubts) that if the functionality is removed, we're going to have trouble with replaying highly dynamic sites (as you know, it's pretty fragile as it is). |
We will implement proper headers storage and mgmt only if necessary to provide nice replay feature. So far, it seems this is not necessary. But, it might be implemented in the future. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## warc2zim2 #130 +/- ##
=============================================
- Coverage 91.85% 86.23% -5.63%
=============================================
Files 5 5
Lines 442 414 -28
Branches 70 65 -5
=============================================
- Hits 406 357 -49
- Misses 27 46 +19
- Partials 9 11 +2 ☔ View full report in Codecov by Sentry. |
Properly define how we store entries in zim file. We introduce `normalize` helper function class in place of `canonicalize`. We work on normalization on converter level. So we path the path to the items instead of letting them call `normalize`. Converted `test/data/video-vimeo.warc.gz` to zim was containing : ``` A/404.html A/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js A/f.vimeocdn.com/p/3.45.3/css/player.css A/f.vimeocdn.com/p/3.45.3/js/player.js A/i.vimeocdn.com/player/354746.png?mw=200&mh=200 A/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85 A/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70 A/index.html A/load.js A/oembed.link/favicon.ico A/oembed.link/https://vimeo.com/347119375 A/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963 A/sw.js A/topFrame.html A/vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4 H/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js H/f.vimeocdn.com/p/3.45.3/css/player.css H/f.vimeocdn.com/p/3.45.3/js/player.js H/i.vimeocdn.com/player/354746.png?mw=200&mh=200 H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85 H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70 H/oembed.link/favicon.ico H/oembed.link/https://vimeo.com/347119375 H/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963 H/vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4 H/vimeo.fuzzy.replayweb.page/video/347119375 H/vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4 ``` With this change it contains: ``` A/404.html A/index.html A/load.js A/sw.js A/topFrame.html H/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js H/f.vimeocdn.com/p/3.45.3/css/player.css H/f.vimeocdn.com/p/3.45.3/js/player.js H/i.vimeocdn.com/player/354746.png?mw=200&mh=200 H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85 H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70 H/oembed.link/favicon.ico H/oembed.link/https://vimeo.com/347119375 H/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963 H/vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4 f.vimeocdn.com/js_opt/modules/utils/vuid.min.js f.vimeocdn.com/p/3.45.3/css/player.css f.vimeocdn.com/p/3.45.3/js/player.js i.vimeocdn.com/player/354746.png?mw=200&mh=200 i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85 i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70 oembed.link/favicon.ico oembed.link/https://vimeo.com/347119375 player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963 vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4 vimeo.fuzzy.replayweb.page/video/347119375 vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4 ```
Before, we were storing a entry using its full path and potentially create a redirect entry (using reduced path) pointing to the full path entry. Now, path reduction is part of normalization and so we directly store entries using their (potentially) reduced path. Converted `test/data/video-vimeo.warc.gz` goes from : ``` A/404.html A/index.html A/load.js A/sw.js A/topFrame.html H/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js H/f.vimeocdn.com/p/3.45.3/css/player.css H/f.vimeocdn.com/p/3.45.3/js/player.js H/i.vimeocdn.com/player/354746.png?mw=200&mh=200 H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85 H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70 H/oembed.link/favicon.ico H/oembed.link/https://vimeo.com/347119375 H/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963 H/vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4 f.vimeocdn.com/js_opt/modules/utils/vuid.min.js f.vimeocdn.com/p/3.45.3/css/player.css f.vimeocdn.com/p/3.45.3/js/player.js i.vimeocdn.com/player/354746.png?mw=200&mh=200 i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85 i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70 oembed.link/favicon.ico oembed.link/https://vimeo.com/347119375 player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963 vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4 vimeo.fuzzy.replayweb.page/video/347119375 vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4 ``` to : ``` A/404.html A/index.html A/load.js A/sw.js A/topFrame.html H/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js H/f.vimeocdn.com/p/3.45.3/css/player.css H/f.vimeocdn.com/p/3.45.3/js/player.js H/i.vimeocdn.com/player/354746.png?mw=200&mh=200 H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85 H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70 H/oembed.link/favicon.ico H/oembed.link/https://vimeo.com/347119375 H/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963 H/vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4 f.vimeocdn.com/js_opt/modules/utils/vuid.min.js f.vimeocdn.com/p/3.45.3/css/player.css f.vimeocdn.com/p/3.45.3/js/player.js i.vimeocdn.com/player/354746.png?mw=200&mh=200 i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85 i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70 oembed.link/favicon.ico oembed.link/https://vimeo.com/347119375 player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963 vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4 vimeo.fuzzy.replayweb.page/video/347119375 ``` Notice that `vod-progressive.akamaized.net` is not present. It is "replaced" by `vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4` which is now a plain entry instead of a redirect to `vod-progressive.akamaized.net[...]`.
We don't use it and we agree to not store them (at least for now). If we need them, we will see how to readd them. Converted `test/data/video-vimeo.warc.gz` goes from : ``` A/404.html A/index.html A/load.js A/sw.js A/topFrame.html H/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js H/f.vimeocdn.com/p/3.45.3/css/player.css H/f.vimeocdn.com/p/3.45.3/js/player.js H/i.vimeocdn.com/player/354746.png?mw=200&mh=200 H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85 H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70 H/oembed.link/favicon.ico H/oembed.link/https://vimeo.com/347119375 H/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963 H/vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4 f.vimeocdn.com/js_opt/modules/utils/vuid.min.js f.vimeocdn.com/p/3.45.3/css/player.css f.vimeocdn.com/p/3.45.3/js/player.js i.vimeocdn.com/player/354746.png?mw=200&mh=200 i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85 i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70 oembed.link/favicon.ico oembed.link/https://vimeo.com/347119375 player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963 vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4 vimeo.fuzzy.replayweb.page/video/347119375 ``` to: ``` A/404.html A/index.html A/load.js A/sw.js A/topFrame.html f.vimeocdn.com/js_opt/modules/utils/vuid.min.js f.vimeocdn.com/p/3.45.3/css/player.css f.vimeocdn.com/p/3.45.3/js/player.js i.vimeocdn.com/player/354746.png?mw=200&mh=200 i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85 i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70 oembed.link/favicon.ico oembed.link/https://vimeo.com/347119375 player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963 vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4 vimeo.fuzzy.replayweb.page/video/347119375 ```
We don't need `WARCHeadersItem` anymore.
Remove other unnecessary files.
We don't have anything now in `A/` or `H/` subdirs. Remove the left over `A/` in test urls (was working thanks to libzim's compatibility layer)
We don't have files to add (and so, no directory).
Assignement expression[*] is new in python 3.8 And python 3.7 is already end of life. Also add testing on 3.12. [*]https://docs.python.org/3/whatsnew/3.8.html#assignment-expressions
d1a7d63
to
0bfddc9
Compare
Rebased fixup. Ready to merge. |
This is a first PR with "real" work on warc2zim2.
sw.js
itself,index.html
, ...)This PR should works with website which don't need any content or url rewriting. Meaning: