Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support single Python version (3.12) and move rewriting stuff from warc2zim to zimscraperlib #204

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Oct 17, 2024

Support single Python version

Maintaining multiple Python versions in scraperlib has proved to not bring much values, since we are the main users of the scraperlib, and other contributors starting a new scraper (me few years ago, joseph more recently) are happy to use a recent Python version which is known to be the "best" available.

It is also a pain, due to some things not being available in old Python versions. For instance, here while importing rewriting logic from warc2zim, the walk_up options of PurePosixPath.relative_to is not available before 3.12 ; and many other semantics were broken in 3.9 and 3.10 (optionals mainly).

Old scraperlib supported old Python versions are still here on PyPi should someone still need to use an old Python, like we do on our "old" scrapers. And from experience, when it is time to use recent features of the scraperlib on a scraper, we always take the update opportunity to also update Python, since this is our goal / duty.

Move rewriting stuff from warc2zim to zimscraperlib

Rewriting CSS, JS and HTML is not something specific to warc2zim, we need to do it for libretexts as well for instance. While being definitely way simpler, it is still a shame to duplicate this logic in every scraper. This PR is hence moving most of the rewriting logic here. Most because the border between warc2zim specifics and general code was not always easy to draw. I'm pretty sure we will rework it slightly in the coming months. It is however far sufficient as a first attempt.

It means that we also moves fuzzy rules definitions to scraperlib, which is great (problematic URLs on one scraper will be probablematic on others as well).

It means that we also moves JS code to setup wombat (including dynamic rewriting with fuzzy rules) in the scraperlib.

The CI hence becomes a bit more complex to handle these JS things.

Changes

While most of the code is identical to warc2zim code, I had to change few things, and also took the opportunity to clean / streamline some obscure corners. Here are the changes:

  • automatically publish compiled wombatSetup.js to https://dev.kiwix.org/zimscraperlib/wombatSetup.js at every push on main branch (with new PublishDev.yaml workflow, and use it at python installation time
  • automatically add version as comment (+ debug log in dev) in wombatSetup.js with rollup-plugin-version-injector
  • automatically remove console.* statements in prod build of wombatSetup.js (fix wombatSetup.js floods console with urlRewritten: messages warc2zim#404) with @rollup/plugin-strip
  • simplify maintenance of CI with multiple chained jobs (avoid the dirty hack of warc2zim with a custom docker image to run stuff, with custom build_js.sh)
  • properly test build/packaging of JS code in Tests.yaml
  • slightly modify rewriting logic to uncouple few things
  • move tests, add more to ensure 100% coverage and proper testing even without warc2zim specific tests
  • move relevant documentation

What is left to decide

Nothing, this has been discussed with rgaudin. Kept here for reference purpose on which decisions have been taken and why.

  • we could publish wombat-setup to NPM ; I finally see little added value in doing so for now, wombatSetup.js is not meant to be used as a dependency in a JS project yet (can we really imagine a Vue.JS with wombat?) ; CI is anyway already ready and tested, I just commented-it out for now => we will not publish it to NPM for now ; we will however include wombat.js (license is AGPL 3 which is OK) and wombatSetup.js (our own code) in both the wheel and the sdist to ease installation and ensure versions are aligned
  • do we want to version wombatSetup.js with semver? I think there is no added value for now, just the build date/time (see first line of https://dev.kiwix.org/zimscraperlib/wombatSetup.js) is sufficient to debug what is going on, and this file is released with the versioned zimscraperlib anyway. => no, inclusion in the sdist and the wheel ensures that it is "tight" to a zimscraperlib version, probably sufficient given current usage

Testing

This PR has been not yet been tested with warc2zim or libretexts, it looks complex enough and it is not a big deal to have a second PR to fix some integration issues should we get into something.

The Publish jobs have been tested for proper operation with hacks to run them on this branch temporarily and not real publish to their target but save artifacts temporarily. This worked well, so merging to main and then releasing should be relatively smooth.

@benoit74 benoit74 self-assigned this Oct 17, 2024
Copy link

codecov bot commented Oct 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (8b6cbac) to head (bed3960).

Additional details and impacted files
@@            Coverage Diff             @@
##              main      #204    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           33        38     +5     
  Lines         1616      2219   +603     
  Branches       303       426   +123     
==========================================
+ Hits          1616      2219   +603     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74 benoit74 force-pushed the content_rewriting branch 12 times, most recently from ca31d4a to cc7c21d Compare October 18, 2024 13:05
@benoit74 benoit74 changed the title Move rewriting stuff from warc2zim to zimscraperlib Support single Python version and move rewriting stuff from warc2zim to zimscraperlib Oct 18, 2024
@benoit74 benoit74 changed the title Support single Python version and move rewriting stuff from warc2zim to zimscraperlib Support single Python version (3.12) and move rewriting stuff from warc2zim to zimscraperlib Oct 18, 2024
@benoit74 benoit74 marked this pull request as ready for review October 18, 2024 13:40
@benoit74 benoit74 requested a review from rgaudin October 18, 2024 13:40
@benoit74
Copy link
Collaborator Author

@rgaudin good luck with this review, sorry for that, not sure how you can save time on this big thing ... especially since 90% of the code has already been reviewed in warc2zim ...

Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree with your suggestion not to publish to NPM (at least for now), I am surprised that we dont publish it at all on releases.

We just have a dev version that fluctuates with main but we cannot refer to a specific version/build and that needs to be fixed.

If the goal is to build at runtime, then there should be no automatic/trasnparent fallback. Those who cant build it and want to use the fallback should be satisfied with a note in the README.

Beside this question that is not clear, it all looks good to me.

.github/workflows/Publish.yaml Outdated Show resolved Hide resolved
pip install -U pip
pip install -e .[scripts]

- name: Generate rules
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why we need to generate the rules for QA. Are we checking the format/linting of the compiled JS?

Copy link
Collaborator Author

@benoit74 benoit74 Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not checking JS (I've been lazy on this tbh) but we are in Python, and it is even mandatory for pyright not to complain about missing import.

.pre-commit-config.yaml Outdated Show resolved Hide resolved
openzim.toml Outdated
source="https://cdn.jsdelivr.net/npm/@webrecorder/wombat@3.8.2/dist/wombat.js"
target_file="wombat.js"

[files.assets.actions."wombatSetup.js"] # fallback if this script has not been properly built
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unclear. What is the fallback and what are the conditions to trigger it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed as discussed, we include it in both sdist and wheel, so we do not need it anywhere else.

rules/rules.yaml Outdated
# scraperlib we are not really fuzzy matching (searching the best entry among existing
# ones) but just rewriting to proper path.
#
# This file is in sync with content at commit 879018d5b96962df82340a9a57570bbc0fc67815
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit is nowhere to be found: neither on war2zim, zimit or scraperlib. Should be fixed or removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a commit of wabac.js since we are speaking of syncing with this repo. I will update the comment to put the whole link to webrecorder/wabac.js@879018d so that it is even more obvious

@benoit74
Copy link
Collaborator Author

We just have a dev version that fluctuates with main but we cannot refer to a specific version/build and that needs to be fixed.

We do release / version / publish it in the wheel, and this might be sufficient from my (very recent and uncertain) PoV. The fallback is more for those who would prefer to use the sdist but don't mind / have sufficient knowledge / understanding of the project to build the JS part (since it is kind of a trick anyway).

Let's discuss it live, I agree it is far from simple.

@benoit74
Copy link
Collaborator Author

New change:

  • small fixes linked to conversations (update dev dependencies, associated fixes, ...)
  • move relevant documentation from warc2zim to here
  • include wombat.js, wombatSetup.js and generated rules files in both sdist and wheel (and not just the wheel)

@benoit74 benoit74 requested a review from rgaudin October 21, 2024 09:58
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I think the Software Arch and Technical Arch are both rewriting-specific and should be marked as such

src/zimscraperlib/rewriting/statics/README.md Outdated Show resolved Hide resolved

zimscraperlib has primitives to manipulate pictures with some operations which are known to be shared across scrapers. See `image` module.

## Store and rewrite mostly unmodified HTML, CSS and JS from online website
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything that follows is related to warc2zim-like context: storing generic web stuff into a ZIM. The language used is not adapted for a general, scraperlib-level document.
Instead of fixing it I suggest to put it all inside a section related to this goal which makes uses of rewrting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is why I named this section like it is, and added the introduction speaking about rewriting. I don't get what more / different you would like to do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean I should rename the files? I've left them more generic because I think it could contain docs about other zimscraperlib functions as well at some point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All those points are on H2 level. I suggest to as an H2 and put the rest in H3.

As for the other two files, an introduction stating that currently this is only about the part in rewritting would remove confusion IMO

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I promise I already intended to move them to H3 ... but obviously it never went into practice and always stayed in my mind ^^

@benoit74 benoit74 requested a review from rgaudin October 21, 2024 12:49
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

wombatSetup.js floods console with urlRewritten: messages
2 participants