Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce normalize and normalization schema. #130

Merged
merged 11 commits into from
Dec 19, 2023
Merged

Conversation

mgautierfr
Copy link
Contributor

This is a first PR with "real" work on warc2zim2.

  • Introduce url normalization (in place of "canonicalize")
  • Don't store headers
  • Remove all service worker stuff (sw.js itself, index.html, ...)

This PR should works with website which don't need any content or url rewriting. Meaning:

  • Local resource only.
  • All links relative
  • No js or "simple" js (following the above rules)
  • No url being reduced (or else, with link pointed already to the reduced url)

Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly one "stupid" question.
I cannot really review the rest, it is too hard for me to follow where we are going. At least I don't see things which looks weird to me.

src/warc2zim/converter.py Show resolved Hide resolved
@benoit74
Copy link
Collaborator

benoit74 commented Dec 8, 2023

  • it looks like you did like I mostly always do, you requested the review before checking the CI status ... and it is failing ^^

@mgautierfr mgautierfr force-pushed the url_normalization branch 4 times, most recently from 366fc13 to 7f1de56 Compare December 8, 2023 15:17
@mgautierfr
Copy link
Contributor Author

We now use the addAlias method, which need the (really) recent version of libzim (PR openzim/python-libzim#175 not yet merged)

And we need to update https://github.com/openzim/python-scraperlib/ as it is through this project that we get libzim deps.

@benoit74
Copy link
Collaborator

benoit74 commented Dec 8, 2023

As we said last monday, we should probably release all this stuff so that the CI and us can test this correctly.

And again, we should not merge this to main but to a temporary branch.

@kelson42
Copy link
Contributor

kelson42 commented Dec 10, 2023

@mgautierfr Please create a zimit2 branch to gather all your (post PR) work.

@mgautierfr mgautierfr changed the base branch from main to warc2zim2 December 11, 2023 10:20
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ; given we're very close to a libzim release, I didn't go through the trouble of building a special pylibzim to run the tests.

Only a few minor comments on my side 👍

src/warc2zim/converter.py Outdated Show resolved Hide resolved
src/warc2zim/url_rewriting.py Show resolved Hide resolved
src/warc2zim/url_rewriting.py Outdated Show resolved Hide resolved
src/warc2zim/url_rewriting.py Outdated Show resolved Hide resolved
@mgautierfr
Copy link
Contributor Author

Two "fixup" commits added in the git history.

@rgaudin rgaudin self-requested a review December 11, 2023 14:38
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good now 👍

@kelson42
Copy link
Contributor

Can we rely on new’y released pythin-libzim 3.4.0 and merge?

@rgaudin
Copy link
Member

rgaudin commented Dec 16, 2023

No it depends on scraperlib

@kelson42
Copy link
Contributor

@rgaudin Then openzim/python-scraperlib#116 is the blocker

@rgaudin
Copy link
Member

rgaudin commented Dec 16, 2023

3.2.0 released

@kelson42
Copy link
Contributor

kelson42 commented Dec 16, 2023

The CI still fails! It's not a problem with latest (python-)libzim. Probably nothing really bad, but we have to fix in priority to move forward.

@Jaifroid
Copy link

Just to double-check: has it really been decided that Headers should never be stored, given that they provide instructions to the browser about how to deal with Responses (see kiwix/overview#95 (comment))? In current warc2zim-generated archives, every Response is constructed from two lookups: a first-pass lookup to get the Response Header, and a second-pass lookup to get the Response Body. Only when both lookups have been done, does the backend have enough information to send a complete Response (mirroring that which was originally recorded) back to the browser. I fear (I hope the fear is unfounded, but I have some doubts) that if the functionality is removed, we're going to have trouble with replaying highly dynamic sites (as you know, it's pretty fragile as it is).

@kelson42
Copy link
Contributor

Just to double-check: has it really been decided that Headers should never be stored

We will implement proper headers storage and mgmt only if necessary to provide nice replay feature. So far, it seems this is not necessary. But, it might be implemented in the future.

Copy link

codecov bot commented Dec 18, 2023

Codecov Report

Attention: 8 lines in your changes are missing coverage. Please review.

Comparison is base (1556717) 91.85% compared to head (d1a7d63) 86.23%.

❗ Current head d1a7d63 differs from pull request most recent head 0bfddc9. Consider uploading reports for the commit 0bfddc9 to get more accurate results

Files Patch % Lines
src/warc2zim/converter.py 81.48% 3 Missing and 2 partials ⚠️
src/warc2zim/url_rewriting.py 90.90% 1 Missing and 1 partial ⚠️
src/warc2zim/items.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##           warc2zim2     #130      +/-   ##
=============================================
- Coverage      91.85%   86.23%   -5.63%     
=============================================
  Files              5        5              
  Lines            442      414      -28     
  Branches          70       65       -5     
=============================================
- Hits             406      357      -49     
- Misses            27       46      +19     
- Partials           9       11       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Properly define how we store entries in zim file.

We introduce `normalize` helper function class in place of `canonicalize`.

We work on normalization on converter level. So we  path the path to the
items instead of letting them call `normalize`.

Converted `test/data/video-vimeo.warc.gz` to zim was containing :

```
A/404.html
A/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
A/f.vimeocdn.com/p/3.45.3/css/player.css
A/f.vimeocdn.com/p/3.45.3/js/player.js
A/i.vimeocdn.com/player/354746.png?mw=200&mh=200
A/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
A/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
A/index.html
A/load.js
A/oembed.link/favicon.ico
A/oembed.link/https://vimeo.com/347119375
A/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
A/sw.js
A/topFrame.html
A/vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4
H/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
H/f.vimeocdn.com/p/3.45.3/css/player.css
H/f.vimeocdn.com/p/3.45.3/js/player.js
H/i.vimeocdn.com/player/354746.png?mw=200&mh=200
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
H/oembed.link/favicon.ico
H/oembed.link/https://vimeo.com/347119375
H/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
H/vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4
H/vimeo.fuzzy.replayweb.page/video/347119375
H/vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4
```

With this change it contains:

```
A/404.html
A/index.html
A/load.js
A/sw.js
A/topFrame.html
H/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
H/f.vimeocdn.com/p/3.45.3/css/player.css
H/f.vimeocdn.com/p/3.45.3/js/player.js
H/i.vimeocdn.com/player/354746.png?mw=200&mh=200
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
H/oembed.link/favicon.ico
H/oembed.link/https://vimeo.com/347119375
H/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
H/vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4
f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
f.vimeocdn.com/p/3.45.3/css/player.css
f.vimeocdn.com/p/3.45.3/js/player.js
i.vimeocdn.com/player/354746.png?mw=200&mh=200
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
oembed.link/favicon.ico
oembed.link/https://vimeo.com/347119375
player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4
vimeo.fuzzy.replayweb.page/video/347119375
vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4
```
Before, we were storing a entry using its full path and potentially
create a redirect entry (using reduced path) pointing to the full path
entry.

Now, path reduction is part of normalization and so we directly store
entries using their (potentially) reduced path.

Converted `test/data/video-vimeo.warc.gz` goes from :

```
A/404.html
A/index.html
A/load.js
A/sw.js
A/topFrame.html
H/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
H/f.vimeocdn.com/p/3.45.3/css/player.css
H/f.vimeocdn.com/p/3.45.3/js/player.js
H/i.vimeocdn.com/player/354746.png?mw=200&mh=200
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
H/oembed.link/favicon.ico
H/oembed.link/https://vimeo.com/347119375
H/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
H/vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4
f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
f.vimeocdn.com/p/3.45.3/css/player.css
f.vimeocdn.com/p/3.45.3/js/player.js
i.vimeocdn.com/player/354746.png?mw=200&mh=200
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
oembed.link/favicon.ico
oembed.link/https://vimeo.com/347119375
player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4
vimeo.fuzzy.replayweb.page/video/347119375
vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4
```

to :

```
A/404.html
A/index.html
A/load.js
A/sw.js
A/topFrame.html
H/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
H/f.vimeocdn.com/p/3.45.3/css/player.css
H/f.vimeocdn.com/p/3.45.3/js/player.js
H/i.vimeocdn.com/player/354746.png?mw=200&mh=200
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
H/oembed.link/favicon.ico
H/oembed.link/https://vimeo.com/347119375
H/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
H/vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4
f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
f.vimeocdn.com/p/3.45.3/css/player.css
f.vimeocdn.com/p/3.45.3/js/player.js
i.vimeocdn.com/player/354746.png?mw=200&mh=200
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
oembed.link/favicon.ico
oembed.link/https://vimeo.com/347119375
player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4
vimeo.fuzzy.replayweb.page/video/347119375
```

Notice that `vod-progressive.akamaized.net` is not present.
It is "replaced" by
`vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4`
which is now a plain entry instead of a redirect to
`vod-progressive.akamaized.net[...]`.
We don't use it and we agree to not store them (at least for now).
If we need them, we will see how to readd them.

Converted `test/data/video-vimeo.warc.gz` goes from :

```
A/404.html
A/index.html
A/load.js
A/sw.js
A/topFrame.html
H/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
H/f.vimeocdn.com/p/3.45.3/css/player.css
H/f.vimeocdn.com/p/3.45.3/js/player.js
H/i.vimeocdn.com/player/354746.png?mw=200&mh=200
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
H/oembed.link/favicon.ico
H/oembed.link/https://vimeo.com/347119375
H/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
H/vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4
f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
f.vimeocdn.com/p/3.45.3/css/player.css
f.vimeocdn.com/p/3.45.3/js/player.js
i.vimeocdn.com/player/354746.png?mw=200&mh=200
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
oembed.link/favicon.ico
oembed.link/https://vimeo.com/347119375
player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4
vimeo.fuzzy.replayweb.page/video/347119375
```

to:

```
A/404.html
A/index.html
A/load.js
A/sw.js
A/topFrame.html
f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
f.vimeocdn.com/p/3.45.3/css/player.css
f.vimeocdn.com/p/3.45.3/js/player.js
i.vimeocdn.com/player/354746.png?mw=200&mh=200
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
oembed.link/favicon.ico
oembed.link/https://vimeo.com/347119375
player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4
vimeo.fuzzy.replayweb.page/video/347119375
```
We don't need `WARCHeadersItem` anymore.
mgautierfr and others added 7 commits December 18, 2023 16:18
Remove other unnecessary files.
We don't have anything now in `A/` or `H/` subdirs.
Remove the left over `A/` in test urls (was working thanks to libzim's
compatibility layer)
We don't have files to add (and so, no directory).
Assignement expression[*] is new in python 3.8
And python 3.7 is already end of life.

Also add testing on 3.12.

[*]https://docs.python.org/3/whatsnew/3.8.html#assignment-expressions
@mgautierfr
Copy link
Contributor Author

Rebased fixup. Ready to merge.

@kelson42 kelson42 merged commit 3f3eb34 into warc2zim2 Dec 19, 2023
10 checks passed
@kelson42 kelson42 deleted the url_normalization branch December 19, 2023 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants