warc2zim, without service worker. #113

mgautierfr · 2023-05-30T12:11:16Z

Introduction

This PR is not really intended to be merged.
This is the result of the kiwix hackathon when we succeed to create "warc2zim" zim files without service worker thanks to static rewriting. This PR (and especially this PR comment) describes what has been done, how and why.

warc file is a collection of record, each record storing the request (headers in our case) and response (payload or revisit).
Each record has a url which is a full url <scheme>://<host>/<path>?<query_string> (record.rec_headers["WARC-Target-URI"]])for "simple" requests (GET). For POST/PUT requests the "query_string" also contains the post data ?__wb_method=POST&__wp_post_data=... (record.urlkey)

Fuzzy matching mostly correspond to transform complex url (https://www.youtube.com/youtubei/v1/foo/baz/things?key=value&other_key=other_value&videoId=xxxx&yet_another_key=yet_another_value) to a simpler one (youtube.fuzzy.replayweb.page/youtubei/v1/foo/baz/things?videoId=xxxx). The idea is that other_value and yet_another_value could be dynamically generated and so, being different from what have been scrapped. So we want to generate url with only the discriminant information. As fuzzy matching describe well the implementation algorthim, the functionality itself is more "reducing" or "simplifying" the url. I will use "reducing" the url (and reduced url) from now on.

Storage of zim entries

We use a simple solution here:

We don't differentiate GET or POST request as the post_data will be in the query_string.
We store records with url <scheme>://<host>/<path>?<query_string> as <host>/<path>?<query_string>. This allow us to handle the <host> as a simple "subdirectory" in the path.
We store header part in H/<host>/<path>?<query_string>. Revisit are stored as zim redirect to <target_host>/<target_path>... entry. [1]
We also generate revisit entry for reduced url. H/<reduced_url> -> /<host>/<full_path>

Static content rewriting

We use the pywb rewriting module in the pywb project.
This is done by simply importing pywb and instantiate needed classes with the right (working at least) options.
Url rewriting seems greatly tied to the dynamic nature of rewriting in pywb/wabac so I override it with a (simple version)[https://github.com/openzim/warc2zim/blob/kiwix_no_sw/src/warc2zim/main.py#L126-L159].
The idea here is pretty simple:

If url is absolute /css/style.css, transform it to /<host>/css/style.css. <host> being taken from the current context, (the article we are currently rewriting the content)
If url is a full uri (//host/css/style.css or http(s)://host/css/style.css), transform it to /host/css/style.css
Transform this absolute url to a relative one (the base being the url of the current article). => ../style.css
Don't change already relative links

So at the end, we store only relative links in our content.

We also insert a small script in each <head> of html content to load wombat and initialize it.
This part is pretty static: wombat is search using the url content/test/A/wombat.js.
This make the created zim file working only with kiwix-serve (because of endpoint /content/) and only if zim file is named test.zim. [TODO] rewrite this url as a relative url (as for all other urls).

The configuration of wombat is also static with the uri http://localhost:1234/content/test). This add a constraint on usage of the zim file as kiwix-serve must be launched on port 1234 and accessed on localhost only.
This could be make dynamic by letting the initialization script inspect the current url or send request to the server (this is a js script after all).

Dynamic url rewriting

As we (statically) insert wombat in all pages and wrap js code with wombat context, all request coming from js (we could not rewrite statically) are catch by wombat and dynamically rewritten. Most of wombat is keep untunched but the rewriteURL method is rewritten (as we do for pywb). Here the idea is:

If url is absolute /css/style.css, transform it to /<host>/css/style.css. <host> being taken from the current context, (the current location)
If url is a full uri (//host/css/style.css or /http(s)://host/css/style.css), transform it to /host/css/style.css
Transform absolute link (/host/css/style.css) to a full uri by prepending the prefix => http://localhost:1234/content/test/host/css/style.css (the prefix coming from wombat configuration)
Keep relative links unchanged.

Content serving

The handling of the request is a bit more complex as we have to reduce the url on the server (it was done by wabac before)[2]. Locating of the content to serve is done this way:

A function locating a content (with the help of getEntryFromPath), handling potential "revisit": Search for /css/style.css, if not found, search for /H/css/style.css and use the targeted item. Here we don't return a 302 for the revisit. You can think of revisit as alias or hardlink. If /css/style.css itself (or the target of /H/css/style.css) is a redirect, we return a 302 as before.
A function searching for the content (reusing function described above) using url reducing. Reducing algorithm is taken from wabac fuzzymatching and it is data oriented. Where the data is stored is open to question [3]

Notes

[1] This avoid us to implement warc header parsing in the server.
[2] Should we reduce the url when we rewrite the links (statically and dynamically) ? It would avoid us to do it in the server when answer a request.
[3] We can store the rules in the zim files or in the clients. In the zim files, we can update the rules without taking care of the client (assuming rules declaration format doesn't change). So only warc2zim would have to be updated. Storing the rules in the clients allow to update rules even for zim files already created (so for zim files not "working").
Server don't try to answer with header described is H/foo/bar.html. It seems to work anyway but we may have to implement it correctly.
We have a bit of freedom about how we generate the path of the entries. For example, we could decide to remove the "main" host from the path (<host>/foo/bar.html to foo/bar.html) to have path that look like similar to what we already have in other zim files. But we have to keep static and dynamic rewriting in sync with that.

Remove other unnecessary files.

Static files need to not being rendered by template engine

File is coming from branch `kiwix_no_sw` of wombat project fork.

`head_insert.html` is mostly taken from pywb.

buffering_record_iter create a temporary buffered_stream which is a copy of raw_stream. But doing so, it reads raw_stream. We (war2zim) read buffered_stream but the pywb rewrite code still read `content_stream()` which return `raw_stream`. Let's monkey patch this.

I don't want to add warc header parsing in kiwix-serve for now.

Better (simpler ?) url rewriting: - Entries are stored using teh path : `host/foo/bar.html<query_and_post_string>` - Url are rewritten relative (which is "easy" as there is no scheme involved, host is just a "subdirectory" - Url are not urlencoded. It is to the server to add the query string. - We html unescape/escape only if we are in `mp_` mode. So now url in css works. - We don't use the pywb `UrlRewriter.rewrite` method now.

rgaudin · 2023-05-30T12:16:51Z

Thank you 👍

Jaifroid · 2023-08-11T08:38:16Z

Hello all, I think it would be good to take stock of where we want to go with the Zimit format, so we don't lose the momentum from May, and so we have a clear roadmap. Also so that @mgautierfr's great work in this PR isn't lost!

I'd also like some clarity on how I should proceed with Kiwix JS. As you know, I was working on integrating the KJS Service Worker with the WARC Service Worker, but this would only provide support for the current format of Zimit ZIMs, and the work would probably have to be re-done if/when we start issuing ZIMs using @mgautierfr's pre-processed ZIM format. I'll soon be in a position to push forward this integration work.

I'm not sure if this is the place to discuss the roadmap, or whether it is better done, say, in the Slack zimit channel (which hasn't had any activity since October 2021).

kelson42 · 2023-08-11T10:46:20Z

Hello all, I think it would be good to take stock of where we want to go with the Zimit format, so we don't lose the momentum from May, and so we have a clear roadmap. Also so that @mgautierfr's great work in this PR isn't lost!

This is scheduled to start this project of ZIMit 2.0 end of this year. More to come then.

I'd also like some clarity on how I should proceed with Kiwix JS. As you know, I was working on integrating the KJS Service Worker with the WARC Service Worker, but this would only provide support for the current format of Zimit ZIMs, and the work would probably have to be re-done if/when we start issuing ZIMs using @mgautierfr's pre-processed ZIM format. I'll soon be in a position to push forward this integration work.

On my side, this is unclear. ZIMit type of ZIM files are in foreground regarding SW, but they are not the only one. Question is fullly open on my side.

I'm not sure if this is the place to discuss the roadmap, or whether it is better done, say, in the Slack zimit channel (which hasn't had any activity since October 2021).

Roadmap (what and when) will come and we should discuss this when we have a first proposal (on my table). Kiwix JS topic is something different, to discuss probably in Kiwix JS repository.

Jaifroid · 2023-08-12T10:13:47Z

OK, thanks @kelson42. That gives me some idea of timelines.

kelson42 · 2023-12-16T19:25:23Z

@mgautierfr This PR was super useful, but I think at this stage we don't need it anymore. Right?

mgautierfr added 8 commits May 26, 2023 15:40

Do not add service worker in zim file.

ea0b101

Remove other unnecessary files.

Be able to add static files.

5585d98

Static files need to not being rendered by template engine

Add wombat.js

b819689

File is coming from branch `kiwix_no_sw` of wombat project fork.

Replace sw_check.html by head_insert.html

047057b

`head_insert.html` is mostly taken from pywb.

Statically rewrite content.

1f5b9dd

Create H/foo redirection instead of add revisit header files.

a88de00

I don't want to add warc header parsing in kiwix-serve for now.

mgautierfr assigned rgaudin, ikreymer and kelson42 May 30, 2023

mgautierfr mentioned this pull request Sep 18, 2023

zimit v2. [libzim/libkiwix/warc2zim part] kiwix/overview#95

Closed

12 tasks

mgautierfr mentioned this pull request Nov 3, 2023

[WIP] Introduce fuzzyRules storage and exploitation. openzim/libzim#835

Closed

kelson42 closed this Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

warc2zim, without service worker. #113

warc2zim, without service worker. #113

mgautierfr commented May 30, 2023 •

edited

Loading

rgaudin commented May 30, 2023

Jaifroid commented Aug 11, 2023

kelson42 commented Aug 11, 2023 •

edited

Loading

Jaifroid commented Aug 12, 2023

kelson42 commented Dec 16, 2023

warc2zim, without service worker. #113

warc2zim, without service worker. #113

Conversation

mgautierfr commented May 30, 2023 • edited Loading

Introduction

Storage of zim entries

Static content rewriting

Dynamic url rewriting

Content serving

Notes

rgaudin commented May 30, 2023

Jaifroid commented Aug 11, 2023

kelson42 commented Aug 11, 2023 • edited Loading

Jaifroid commented Aug 12, 2023

kelson42 commented Dec 16, 2023

mgautierfr commented May 30, 2023 •

edited

Loading

kelson42 commented Aug 11, 2023 •

edited

Loading