Support single Python version (3.12) and move rewriting stuff from warc2zim to zimscraperlib #204

benoit74 · 2024-10-17T09:47:15Z

Support single Python version

Maintaining multiple Python versions in scraperlib has proved to not bring much values, since we are the main users of the scraperlib, and other contributors starting a new scraper (me few years ago, joseph more recently) are happy to use a recent Python version which is known to be the "best" available.

It is also a pain, due to some things not being available in old Python versions. For instance, here while importing rewriting logic from warc2zim, the walk_up options of PurePosixPath.relative_to is not available before 3.12 ; and many other semantics were broken in 3.9 and 3.10 (optionals mainly).

Old scraperlib supported old Python versions are still here on PyPi should someone still need to use an old Python, like we do on our "old" scrapers. And from experience, when it is time to use recent features of the scraperlib on a scraper, we always take the update opportunity to also update Python, since this is our goal / duty.

Move rewriting stuff from warc2zim to zimscraperlib

Rewriting CSS, JS and HTML is not something specific to warc2zim, we need to do it for libretexts as well for instance. While being definitely way simpler, it is still a shame to duplicate this logic in every scraper. This PR is hence moving most of the rewriting logic here. Most because the border between warc2zim specifics and general code was not always easy to draw. I'm pretty sure we will rework it slightly in the coming months. It is however far sufficient as a first attempt.

It means that we also moves fuzzy rules definitions to scraperlib, which is great (problematic URLs on one scraper will be probablematic on others as well).

It means that we also moves JS code to setup wombat (including dynamic rewriting with fuzzy rules) in the scraperlib.

The CI hence becomes a bit more complex to handle these JS things.

Changes

While most of the code is identical to warc2zim code, I had to change few things, and also took the opportunity to clean / streamline some obscure corners. Here are the changes:

automatically publish compiled wombatSetup.js to https://dev.kiwix.org/zimscraperlib/wombatSetup.js at every push on main branch (with new PublishDev.yaml workflow, and use it at python installation time
automatically add version as comment (+ debug log in dev) in wombatSetup.js with rollup-plugin-version-injector
automatically remove console.* statements in prod build of wombatSetup.js (fix wombatSetup.js floods console with urlRewritten: messages warc2zim#404) with @rollup/plugin-strip
simplify maintenance of CI with multiple chained jobs (avoid the dirty hack of warc2zim with a custom docker image to run stuff, with custom build_js.sh)
properly test build/packaging of JS code in Tests.yaml
slightly modify rewriting logic to uncouple few things
move tests, add more to ensure 100% coverage and proper testing even without warc2zim specific tests
move relevant documentation

What is left to decide

Nothing, this has been discussed with rgaudin. Kept here for reference purpose on which decisions have been taken and why.

we could publish wombat-setup to NPM ; I finally see little added value in doing so for now, wombatSetup.js is not meant to be used as a dependency in a JS project yet (can we really imagine a Vue.JS with wombat?) ; CI is anyway already ready and tested, I just commented-it out for now => we will not publish it to NPM for now ; we will however include wombat.js (license is AGPL 3 which is OK) and wombatSetup.js (our own code) in both the wheel and the sdist to ease installation and ensure versions are aligned
do we want to version wombatSetup.js with semver? I think there is no added value for now, just the build date/time (see first line of https://dev.kiwix.org/zimscraperlib/wombatSetup.js) is sufficient to debug what is going on, and this file is released with the versioned zimscraperlib anyway. => no, inclusion in the sdist and the wheel ensures that it is "tight" to a zimscraperlib version, probably sufficient given current usage

Testing

This PR has been not yet been tested with warc2zim or libretexts, it looks complex enough and it is not a big deal to have a second PR to fix some integration issues should we get into something.

The Publish jobs have been tested for proper operation with hacks to run them on this branch temporarily and not real publish to their target but save artifacts temporarily. This worked well, so merging to main and then releasing should be relatively smooth.

codecov · 2024-10-17T09:49:45Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (8b6cbac) to head (bed3960).

Additional details and impacted files

@@            Coverage Diff             @@
##              main      #204    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           33        38     +5     
  Lines         1616      2219   +603     
  Branches       303       426   +123     
==========================================
+ Hits          1616      2219   +603

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benoit74 · 2024-10-18T13:42:37Z

@rgaudin good luck with this review, sorry for that, not sure how you can save time on this big thing ... especially since 90% of the code has already been reviewed in warc2zim ...

rgaudin

While I agree with your suggestion not to publish to NPM (at least for now), I am surprised that we dont publish it at all on releases.

We just have a dev version that fluctuates with main but we cannot refer to a specific version/build and that needs to be fixed.

If the goal is to build at runtime, then there should be no automatic/trasnparent fallback. Those who cant build it and want to use the fallback should be satisfied with a note in the README.

Beside this question that is not clear, it all looks good to me.

.github/workflows/Publish.yaml

rgaudin · 2024-10-18T15:16:05Z

.github/workflows/QA.yaml

+          pip install -U pip
+          pip install -e .[scripts]
+
+      - name: Generate rules


It's not clear to me why we need to generate the rules for QA. Are we checking the format/linting of the compiled JS?

We are not checking JS (I've been lazy on this tbh) but we are in Python, and it is even mandatory for pyright not to complain about missing import.

.pre-commit-config.yaml

rgaudin · 2024-10-18T15:44:22Z

openzim.toml

+source="https://cdn.jsdelivr.net/npm/@webrecorder/wombat@3.8.2/dist/wombat.js"
+target_file="wombat.js"
+
+[files.assets.actions."wombatSetup.js"] # fallback if this script has not been properly built


This is unclear. What is the fallback and what are the conditions to trigger it?

Removed as discussed, we include it in both sdist and wheel, so we do not need it anywhere else.

rgaudin · 2024-10-19T16:36:48Z

rules/rules.yaml

+# scraperlib we are not really fuzzy matching (searching the best entry among existing
+# ones) but just rewriting to proper path.
+#
+# This file is in sync with content at commit 879018d5b96962df82340a9a57570bbc0fc67815


This commit is nowhere to be found: neither on war2zim, zimit or scraperlib. Should be fixed or removed.

This is a commit of wabac.js since we are speaking of syncing with this repo. I will update the comment to put the whole link to webrecorder/wabac.js@879018d so that it is even more obvious

benoit74 · 2024-10-21T06:50:53Z

We just have a dev version that fluctuates with main but we cannot refer to a specific version/build and that needs to be fixed.

We do release / version / publish it in the wheel, and this might be sufficient from my (very recent and uncertain) PoV. The fallback is more for those who would prefer to use the sdist but don't mind / have sufficient knowledge / understanding of the project to build the JS part (since it is kind of a trick anyway).

Let's discuss it live, I agree it is far from simple.

benoit74 · 2024-10-21T09:56:20Z

New change:

small fixes linked to conversations (update dev dependencies, associated fixes, ...)
move relevant documentation from warc2zim to here
include wombat.js, wombatSetup.js and generated rules files in both sdist and wheel (and not just the wheel)

rgaudin

LGTM but I think the Software Arch and Technical Arch are both rewriting-specific and should be marked as such

src/zimscraperlib/rewriting/statics/README.md

rgaudin · 2024-10-21T11:25:38Z

docs/functional_architecture.md

+
+zimscraperlib has primitives to manipulate pictures with some operations which are known to be shared across scrapers. See `image` module.
+
+## Store and rewrite mostly unmodified HTML, CSS and JS from online website


Everything that follows is related to warc2zim-like context: storing generic web stuff into a ZIM. The language used is not adapted for a general, scraperlib-level document.
Instead of fixing it I suggest to put it all inside a section related to this goal which makes uses of rewrting.

Yes, this is why I named this section like it is, and added the introduction speaking about rewriting. I don't get what more / different you would like to do.

Do you mean I should rename the files? I've left them more generic because I think it could contain docs about other zimscraperlib functions as well at some point.

All those points are on H2 level. I suggest to as an H2 and put the rest in H3.

As for the other two files, an introduction stating that currently this is only about the part in rewritting would remove confusion IMO

I promise I already intended to move them to H3 ... but obviously it never went into practice and always stayed in my mind ^^

rgaudin

👍

benoit74 self-assigned this Oct 17, 2024

benoit74 force-pushed the content_rewriting branch 12 times, most recently from ca31d4a to cc7c21d Compare October 18, 2024 13:05

benoit74 changed the title ~~Move rewriting stuff from warc2zim to zimscraperlib~~ Support single Python version and move rewriting stuff from warc2zim to zimscraperlib Oct 18, 2024

benoit74 force-pushed the content_rewriting branch from cc7c21d to 126436e Compare October 18, 2024 13:10

benoit74 changed the title ~~Support single Python version and move rewriting stuff from warc2zim to zimscraperlib~~ Support single Python version (3.12) and move rewriting stuff from warc2zim to zimscraperlib Oct 18, 2024

Move rewriting stuff from warc2zim to zimscraperlib

39ab439

benoit74 force-pushed the content_rewriting branch from dc9e04a to 39ab439 Compare October 18, 2024 13:38

benoit74 marked this pull request as ready for review October 18, 2024 13:40

benoit74 requested a review from rgaudin October 18, 2024 13:40

benoit74 mentioned this pull request Oct 18, 2024

Move to content rewriting code from zimscraperlib openzim/libretexts#33

Draft

rgaudin requested changes Oct 19, 2024

View reviewed changes

benoit74 added 4 commits October 21, 2024 08:48

fixup! Move rewriting stuff from warc2zim to zimscraperlib

b613fca

Add documentation, especially warc2zim doc about rewriting

c0f4e19

Fix formatting - prettier

17c8ec8

Fix for updated ruff PLR1714 rule

0d0a10f

benoit74 requested a review from rgaudin October 21, 2024 09:58

rgaudin requested changes Oct 21, 2024

View reviewed changes

fixup! Add documentation, especially warc2zim doc about rewriting

bed3960

benoit74 requested a review from rgaudin October 21, 2024 12:49

rgaudin approved these changes Oct 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support single Python version (3.12) and move rewriting stuff from warc2zim to zimscraperlib #204

Support single Python version (3.12) and move rewriting stuff from warc2zim to zimscraperlib #204

benoit74 commented Oct 17, 2024 •

edited

Loading

codecov bot commented Oct 17, 2024 •

edited

Loading

benoit74 commented Oct 18, 2024

rgaudin left a comment

rgaudin Oct 18, 2024

benoit74 Oct 21, 2024 •

edited

Loading

rgaudin Oct 18, 2024

benoit74 Oct 21, 2024

rgaudin Oct 19, 2024

benoit74 Oct 21, 2024

benoit74 commented Oct 21, 2024

benoit74 commented Oct 21, 2024

rgaudin left a comment

rgaudin Oct 21, 2024

benoit74 Oct 21, 2024

benoit74 Oct 21, 2024

rgaudin Oct 21, 2024

benoit74 Oct 21, 2024

rgaudin left a comment


		zimscraperlib has primitives to manipulate pictures with some operations which are known to be shared across scrapers. See `image` module.

		## Store and rewrite mostly unmodified HTML, CSS and JS from online website

Support single Python version (3.12) and move rewriting stuff from warc2zim to zimscraperlib #204

Are you sure you want to change the base?

Support single Python version (3.12) and move rewriting stuff from warc2zim to zimscraperlib #204

Conversation

benoit74 commented Oct 17, 2024 • edited Loading