Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warc2zim, without service worker. #113

Closed
wants to merge 8 commits into from
Closed

warc2zim, without service worker. #113

wants to merge 8 commits into from

Conversation

mgautierfr
Copy link
Contributor

@mgautierfr mgautierfr commented May 30, 2023

Introduction

This PR is not really intended to be merged.
This is the result of the kiwix hackathon when we succeed to create "warc2zim" zim files without service worker thanks to static rewriting. This PR (and especially this PR comment) describes what has been done, how and why.

warc file is a collection of record, each record storing the request (headers in our case) and response (payload or revisit).
Each record has a url which is a full url <scheme>://<host>/<path>?<query_string> (record.rec_headers["WARC-Target-URI"]])for "simple" requests (GET). For POST/PUT requests the "query_string" also contains the post data ?__wb_method=POST&__wp_post_data=... (record.urlkey)

Fuzzy matching mostly correspond to transform complex url (https://www.youtube.com/youtubei/v1/foo/baz/things?key=value&other_key=other_value&videoId=xxxx&yet_another_key=yet_another_value) to a simpler one (youtube.fuzzy.replayweb.page/youtubei/v1/foo/baz/things?videoId=xxxx). The idea is that other_value and yet_another_value could be dynamically generated and so, being different from what have been scrapped. So we want to generate url with only the discriminant information. As fuzzy matching describe well the implementation algorthim, the functionality itself is more "reducing" or "simplifying" the url. I will use "reducing" the url (and reduced url) from now on.

Storage of zim entries

We use a simple solution here:

  • We don't differentiate GET or POST request as the post_data will be in the query_string.
  • We store records with url <scheme>://<host>/<path>?<query_string> as <host>/<path>?<query_string>. This allow us to handle the <host> as a simple "subdirectory" in the path.
  • We store header part in H/<host>/<path>?<query_string>. Revisit are stored as zim redirect to <target_host>/<target_path>... entry. [1]
  • We also generate revisit entry for reduced url. H/<reduced_url> -> /<host>/<full_path>

Static content rewriting

We use the pywb rewriting module in the pywb project.
This is done by simply importing pywb and instantiate needed classes with the right (working at least) options.
Url rewriting seems greatly tied to the dynamic nature of rewriting in pywb/wabac so I override it with a (simple version)[https://github.com/openzim/warc2zim/blob/kiwix_no_sw/src/warc2zim/main.py#L126-L159].
The idea here is pretty simple:

  • If url is absolute /css/style.css, transform it to /<host>/css/style.css. <host> being taken from the current context, (the article we are currently rewriting the content)
  • If url is a full uri (//host/css/style.css or http(s)://host/css/style.css), transform it to /host/css/style.css
  • Transform this absolute url to a relative one (the base being the url of the current article). => ../style.css
  • Don't change already relative links

So at the end, we store only relative links in our content.

We also insert a small script in each <head> of html content to load wombat and initialize it.
This part is pretty static: wombat is search using the url content/test/A/wombat.js.
This make the created zim file working only with kiwix-serve (because of endpoint /content/) and only if zim file is named test.zim. [TODO] rewrite this url as a relative url (as for all other urls).

The configuration of wombat is also static with the uri http://localhost:1234/content/test). This add a constraint on usage of the zim file as kiwix-serve must be launched on port 1234 and accessed on localhost only.
This could be make dynamic by letting the initialization script inspect the current url or send request to the server (this is a js script after all).

Dynamic url rewriting

As we (statically) insert wombat in all pages and wrap js code with wombat context, all request coming from js (we could not rewrite statically) are catch by wombat and dynamically rewritten. Most of wombat is keep untunched but the rewriteURL method is rewritten (as we do for pywb). Here the idea is:

  • If url is absolute /css/style.css, transform it to /<host>/css/style.css. <host> being taken from the current context, (the current location)
  • If url is a full uri (//host/css/style.css or /http(s)://host/css/style.css), transform it to /host/css/style.css
  • Transform absolute link (/host/css/style.css) to a full uri by prepending the prefix => http://localhost:1234/content/test/host/css/style.css (the prefix coming from wombat configuration)
  • Keep relative links unchanged.

Content serving

The handling of the request is a bit more complex as we have to reduce the url on the server (it was done by wabac before)[2]. Locating of the content to serve is done this way:

  • A function locating a content (with the help of getEntryFromPath), handling potential "revisit": Search for /css/style.css, if not found, search for /H/css/style.css and use the targeted item. Here we don't return a 302 for the revisit. You can think of revisit as alias or hardlink. If /css/style.css itself (or the target of /H/css/style.css) is a redirect, we return a 302 as before.
  • A function searching for the content (reusing function described above) using url reducing. Reducing algorithm is taken from wabac fuzzymatching and it is data oriented. Where the data is stored is open to question [3]

Notes

  • [1] This avoid us to implement warc header parsing in the server.
  • [2] Should we reduce the url when we rewrite the links (statically and dynamically) ? It would avoid us to do it in the server when answer a request.
  • [3] We can store the rules in the zim files or in the clients. In the zim files, we can update the rules without taking care of the client (assuming rules declaration format doesn't change). So only warc2zim would have to be updated. Storing the rules in the clients allow to update rules even for zim files already created (so for zim files not "working").
  • Server don't try to answer with header described is H/foo/bar.html. It seems to work anyway but we may have to implement it correctly.
  • We have a bit of freedom about how we generate the path of the entries. For example, we could decide to remove the "main" host from the path (<host>/foo/bar.html to foo/bar.html) to have path that look like similar to what we already have in other zim files. But we have to keep static and dynamic rewriting in sync with that.

Remove other unnecessary files.
Static files need to not being rendered by template engine
File is coming from branch `kiwix_no_sw` of wombat project fork.
`head_insert.html` is mostly taken from pywb.
buffering_record_iter create a temporary buffered_stream which is a copy
of raw_stream. But doing so, it reads raw_stream.

We (war2zim) read buffered_stream but the pywb rewrite code still
read `content_stream()` which return `raw_stream`.

Let's monkey patch this.
I don't want to add warc header parsing in kiwix-serve for now.
Better (simpler ?) url rewriting:
- Entries are stored using teh path :
  `host/foo/bar.html<query_and_post_string>`
- Url are rewritten relative (which is "easy" as there is no scheme
  involved, host is just a "subdirectory"
- Url are not urlencoded. It is to the server to add the query string.
- We html unescape/escape only if we are in `mp_` mode. So now url in
  css works.
- We don't use the pywb `UrlRewriter.rewrite` method now.
@rgaudin
Copy link
Member

rgaudin commented May 30, 2023

Thank you 👍

@Jaifroid
Copy link

Hello all, I think it would be good to take stock of where we want to go with the Zimit format, so we don't lose the momentum from May, and so we have a clear roadmap. Also so that @mgautierfr's great work in this PR isn't lost!

I'd also like some clarity on how I should proceed with Kiwix JS. As you know, I was working on integrating the KJS Service Worker with the WARC Service Worker, but this would only provide support for the current format of Zimit ZIMs, and the work would probably have to be re-done if/when we start issuing ZIMs using @mgautierfr's pre-processed ZIM format. I'll soon be in a position to push forward this integration work.

I'm not sure if this is the place to discuss the roadmap, or whether it is better done, say, in the Slack zimit channel (which hasn't had any activity since October 2021).

@kelson42
Copy link
Contributor

kelson42 commented Aug 11, 2023

Hello all, I think it would be good to take stock of where we want to go with the Zimit format, so we don't lose the momentum from May, and so we have a clear roadmap. Also so that @mgautierfr's great work in this PR isn't lost!

This is scheduled to start this project of ZIMit 2.0 end of this year. More to come then.

I'd also like some clarity on how I should proceed with Kiwix JS. As you know, I was working on integrating the KJS Service Worker with the WARC Service Worker, but this would only provide support for the current format of Zimit ZIMs, and the work would probably have to be re-done if/when we start issuing ZIMs using @mgautierfr's pre-processed ZIM format. I'll soon be in a position to push forward this integration work.

On my side, this is unclear. ZIMit type of ZIM files are in foreground regarding SW, but they are not the only one. Question is fullly open on my side.

I'm not sure if this is the place to discuss the roadmap, or whether it is better done, say, in the Slack zimit channel (which hasn't had any activity since October 2021).

Roadmap (what and when) will come and we should discuss this when we have a first proposal (on my table). Kiwix JS topic is something different, to discuss probably in Kiwix JS repository.

@Jaifroid
Copy link

OK, thanks @kelson42. That gives me some idea of timelines.

@kelson42
Copy link
Contributor

@mgautierfr This PR was super useful, but I think at this stage we don't need it anymore. Right?

@kelson42 kelson42 closed this Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants