-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support single Python version (3.12) and move rewriting stuff from warc2zim to zimscraperlib #204
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #204 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 33 38 +5
Lines 1616 2219 +603
Branches 303 426 +123
==========================================
+ Hits 1616 2219 +603 ☔ View full report in Codecov by Sentry. |
ca31d4a
to
cc7c21d
Compare
cc7c21d
to
126436e
Compare
dc9e04a
to
39ab439
Compare
@rgaudin good luck with this review, sorry for that, not sure how you can save time on this big thing ... especially since 90% of the code has already been reviewed in warc2zim ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree with your suggestion not to publish to NPM (at least for now), I am surprised that we dont publish it at all on releases.
We just have a dev version that fluctuates with main
but we cannot refer to a specific version/build and that needs to be fixed.
If the goal is to build at runtime, then there should be no automatic/trasnparent fallback. Those who cant build it and want to use the fallback should be satisfied with a note in the README.
Beside this question that is not clear, it all looks good to me.
pip install -U pip | ||
pip install -e .[scripts] | ||
|
||
- name: Generate rules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why we need to generate the rules for QA. Are we checking the format/linting of the compiled JS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not checking JS (I've been lazy on this tbh) but we are in Python, and it is even mandatory for pyright not to complain about missing import.
openzim.toml
Outdated
source="https://cdn.jsdelivr.net/npm/@webrecorder/wombat@3.8.2/dist/wombat.js" | ||
target_file="wombat.js" | ||
|
||
[files.assets.actions."wombatSetup.js"] # fallback if this script has not been properly built |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unclear. What is the fallback and what are the conditions to trigger it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed as discussed, we include it in both sdist and wheel, so we do not need it anywhere else.
rules/rules.yaml
Outdated
# scraperlib we are not really fuzzy matching (searching the best entry among existing | ||
# ones) but just rewriting to proper path. | ||
# | ||
# This file is in sync with content at commit 879018d5b96962df82340a9a57570bbc0fc67815 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit is nowhere to be found: neither on war2zim, zimit or scraperlib. Should be fixed or removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a commit of wabac.js since we are speaking of syncing with this repo. I will update the comment to put the whole link to webrecorder/wabac.js@879018d so that it is even more obvious
We do release / version / publish it in the wheel, and this might be sufficient from my (very recent and uncertain) PoV. The fallback is more for those who would prefer to use the sdist but don't mind / have sufficient knowledge / understanding of the project to build the JS part (since it is kind of a trick anyway). Let's discuss it live, I agree it is far from simple. |
New change:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I think the Software Arch and Technical Arch are both rewriting-specific and should be marked as such
|
||
zimscraperlib has primitives to manipulate pictures with some operations which are known to be shared across scrapers. See `image` module. | ||
|
||
## Store and rewrite mostly unmodified HTML, CSS and JS from online website |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything that follows is related to warc2zim-like context: storing generic web stuff into a ZIM. The language used is not adapted for a general, scraperlib-level document.
Instead of fixing it I suggest to put it all inside a section related to this goal which makes uses of rewrting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is why I named this section like it is, and added the introduction speaking about rewriting. I don't get what more / different you would like to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean I should rename the files? I've left them more generic because I think it could contain docs about other zimscraperlib functions as well at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All those points are on H2 level. I suggest to as an H2 and put the rest in H3.
As for the other two files, an introduction stating that currently this is only about the part in rewritting
would remove confusion IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I promise I already intended to move them to H3 ... but obviously it never went into practice and always stayed in my mind ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Support single Python version
Maintaining multiple Python versions in scraperlib has proved to not bring much values, since we are the main users of the scraperlib, and other contributors starting a new scraper (me few years ago, joseph more recently) are happy to use a recent Python version which is known to be the "best" available.
It is also a pain, due to some things not being available in old Python versions. For instance, here while importing rewriting logic from warc2zim, the
walk_up
options ofPurePosixPath.relative_to
is not available before 3.12 ; and many other semantics were broken in 3.9 and 3.10 (optionals mainly).Old scraperlib supported old Python versions are still here on PyPi should someone still need to use an old Python, like we do on our "old" scrapers. And from experience, when it is time to use recent features of the scraperlib on a scraper, we always take the update opportunity to also update Python, since this is our goal / duty.
Move rewriting stuff from warc2zim to zimscraperlib
Rewriting CSS, JS and HTML is not something specific to warc2zim, we need to do it for libretexts as well for instance. While being definitely way simpler, it is still a shame to duplicate this logic in every scraper. This PR is hence moving most of the rewriting logic here. Most because the border between warc2zim specifics and general code was not always easy to draw. I'm pretty sure we will rework it slightly in the coming months. It is however far sufficient as a first attempt.
It means that we also moves fuzzy rules definitions to scraperlib, which is great (problematic URLs on one scraper will be probablematic on others as well).
It means that we also moves JS code to setup wombat (including dynamic rewriting with fuzzy rules) in the scraperlib.
The CI hence becomes a bit more complex to handle these JS things.
Changes
While most of the code is identical to warc2zim code, I had to change few things, and also took the opportunity to clean / streamline some obscure corners. Here are the changes:
main
branch (with newPublishDev.yaml
workflow, and use it at python installation timeconsole.*
statements in prod build of wombatSetup.js (fixwombatSetup.js
floods console withurlRewritten:
messages warc2zim#404) with @rollup/plugin-stripbuild_js.sh
)Tests.yaml
What is left to decide
Nothing, this has been discussed with rgaudin. Kept here for reference purpose on which decisions have been taken and why.
we could publish wombat-setup to NPM ; I finally see little added value in doing so for now, wombatSetup.js is not meant to be used as a dependency in a JS project yet (can we really imagine a Vue.JS with wombat?) ; CI is anyway already ready and tested, I just commented-it out for now=> we will not publish it to NPM for now ; we will however include wombat.js (license is AGPL 3 which is OK) and wombatSetup.js (our own code) in both the wheel and the sdist to ease installation and ensure versions are aligneddo we want to version wombatSetup.js with semver? I think there is no added value for now, just the build date/time (see first line of https://dev.kiwix.org/zimscraperlib/wombatSetup.js) is sufficient to debug what is going on, and this file is released with the versioned zimscraperlib anyway.=> no, inclusion in the sdist and the wheel ensures that it is "tight" to a zimscraperlib version, probably sufficient given current usageTesting
This PR has been not yet been tested with warc2zim or libretexts, it looks complex enough and it is not a big deal to have a second PR to fix some integration issues should we get into something.
The
Publish
jobs have been tested for proper operation with hacks to run them on this branch temporarily and not real publish to their target but save artifacts temporarily. This worked well, so merging to main and then releasing should be relatively smooth.