-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
warc2zim, without service worker. #113
Conversation
Remove other unnecessary files.
Static files need to not being rendered by template engine
File is coming from branch `kiwix_no_sw` of wombat project fork.
`head_insert.html` is mostly taken from pywb.
buffering_record_iter create a temporary buffered_stream which is a copy of raw_stream. But doing so, it reads raw_stream. We (war2zim) read buffered_stream but the pywb rewrite code still read `content_stream()` which return `raw_stream`. Let's monkey patch this.
I don't want to add warc header parsing in kiwix-serve for now.
Better (simpler ?) url rewriting: - Entries are stored using teh path : `host/foo/bar.html<query_and_post_string>` - Url are rewritten relative (which is "easy" as there is no scheme involved, host is just a "subdirectory" - Url are not urlencoded. It is to the server to add the query string. - We html unescape/escape only if we are in `mp_` mode. So now url in css works. - We don't use the pywb `UrlRewriter.rewrite` method now.
Thank you 👍 |
Hello all, I think it would be good to take stock of where we want to go with the Zimit format, so we don't lose the momentum from May, and so we have a clear roadmap. Also so that @mgautierfr's great work in this PR isn't lost! I'd also like some clarity on how I should proceed with Kiwix JS. As you know, I was working on integrating the KJS Service Worker with the WARC Service Worker, but this would only provide support for the current format of Zimit ZIMs, and the work would probably have to be re-done if/when we start issuing ZIMs using @mgautierfr's pre-processed ZIM format. I'll soon be in a position to push forward this integration work. I'm not sure if this is the place to discuss the roadmap, or whether it is better done, say, in the Slack zimit channel (which hasn't had any activity since October 2021). |
This is scheduled to start this project of ZIMit 2.0 end of this year. More to come then.
On my side, this is unclear. ZIMit type of ZIM files are in foreground regarding SW, but they are not the only one. Question is fullly open on my side.
Roadmap (what and when) will come and we should discuss this when we have a first proposal (on my table). Kiwix JS topic is something different, to discuss probably in Kiwix JS repository. |
OK, thanks @kelson42. That gives me some idea of timelines. |
@mgautierfr This PR was super useful, but I think at this stage we don't need it anymore. Right? |
Introduction
This PR is not really intended to be merged.
This is the result of the kiwix hackathon when we succeed to create "warc2zim" zim files without service worker thanks to static rewriting. This PR (and especially this PR comment) describes what has been done, how and why.
warc file is a collection of record, each record storing the request (headers in our case) and response (payload or revisit).
Each record has a url which is a full url
<scheme>://<host>/<path>?<query_string>
(record.rec_headers["WARC-Target-URI"]
])for "simple" requests (GET). For POST/PUT requests the "query_string" also contains the post data?__wb_method=POST&__wp_post_data=...
(record.urlkey
)Fuzzy matching mostly correspond to transform complex url (
https://www.youtube.com/youtubei/v1/foo/baz/things?key=value&other_key=other_value&videoId=xxxx&yet_another_key=yet_another_value
) to a simpler one (youtube.fuzzy.replayweb.page/youtubei/v1/foo/baz/things?videoId=xxxx
). The idea is thatother_value
andyet_another_value
could be dynamically generated and so, being different from what have been scrapped. So we want to generate url with only the discriminant information. As fuzzy matching describe well the implementation algorthim, the functionality itself is more "reducing" or "simplifying" the url. I will use "reducing" the url (and reduced url) from now on.Storage of zim entries
We use a simple solution here:
<scheme>://<host>/<path>?<query_string>
as<host>/<path>?<query_string>
. This allow us to handle the<host>
as a simple "subdirectory" in the path.H/<host>/<path>?<query_string>
. Revisit are stored as zim redirect to<target_host>/<target_path>...
entry. [1]H/<reduced_url>
->/<host>/<full_path>
Static content rewriting
We use the pywb rewriting module in the pywb project.
This is done by simply importing pywb and instantiate needed classes with the right (working at least) options.
Url rewriting seems greatly tied to the dynamic nature of rewriting in pywb/wabac so I override it with a (simple version)[https://github.com/openzim/warc2zim/blob/kiwix_no_sw/src/warc2zim/main.py#L126-L159].
The idea here is pretty simple:
/css/style.css
, transform it to/<host>/css/style.css
.<host>
being taken from the current context, (the article we are currently rewriting the content)//host/css/style.css
orhttp(s)://host/css/style.css
), transform it to/host/css/style.css
../style.css
So at the end, we store only relative links in our content.
We also insert a small script in each
<head>
of html content to load wombat and initialize it.This part is pretty static: wombat is search using the url
content/test/A/wombat.js
.This make the created zim file working only with kiwix-serve (because of endpoint
/content/
) and only if zim file is namedtest.zim
. [TODO] rewrite this url as a relative url (as for all other urls).The configuration of wombat is also static with the uri
http://localhost:1234/content/test
). This add a constraint on usage of the zim file as kiwix-serve must be launched on port 1234 and accessed on localhost only.This could be make dynamic by letting the initialization script inspect the current url or send request to the server (this is a js script after all).
Dynamic url rewriting
As we (statically) insert wombat in all pages and wrap js code with wombat context, all request coming from js (we could not rewrite statically) are catch by wombat and dynamically rewritten. Most of wombat is keep untunched but the rewriteURL method is rewritten (as we do for pywb). Here the idea is:
/css/style.css
, transform it to/<host>/css/style.css
.<host>
being taken from the current context, (the current location)//host/css/style.css
or/http(s)://host/css/style.css
), transform it to/host/css/style.css
/host/css/style.css
) to a full uri by prepending the prefix =>http://localhost:1234/content/test/host/css/style.css
(the prefix coming from wombat configuration)Content serving
The handling of the request is a bit more complex as we have to reduce the url on the server (it was done by wabac before)[2]. Locating of the content to serve is done this way:
/css/style.css
, if not found, search for/H/css/style.css
and use the targeted item. Here we don't return a 302 for the revisit. You can think of revisit as alias or hardlink. If/css/style.css
itself (or the target of/H/css/style.css
) is a redirect, we return a 302 as before.Notes
H/foo/bar.html
. It seems to work anyway but we may have to implement it correctly.<host>/foo/bar.html
tofoo/bar.html
) to have path that look like similar to what we already have in other zim files. But we have to keep static and dynamic rewriting in sync with that.