Apply proper CSS for proper page display - step 1 #29

benoit74 · 2024-10-08T09:28:29Z

This is part of #8

This first step takes care of CSS stylesheets which are in an external file (two indeed, one for screen and one for print).

It handles fetching all assets (images, fonts) referenced in the CSS.

It handles rewriting of the CSS to fix URLs.

It does not consider inline CSS which is needed and will be handled in a step 2.

codecov · 2024-10-08T09:30:04Z

Codecov Report

Attention: Patch coverage is 68.51852% with 34 lines in your changes missing coverage. Please review.

Project coverage is 51.48%. Comparing base (733c35a) to head (4749161).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
scraper/src/libretexts2zim/processor.py	8.69%	21 Missing ⚠️
scraper/src/libretexts2zim/client.py	35.71%	9 Missing ⚠️
scraper/src/libretexts2zim/css.py	93.22%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #29      +/-   ##
==========================================
+ Coverage   47.22%   51.48%   +4.25%     
==========================================
  Files           7        9       +2     
  Lines         432      540     +108     
  Branches       45       61      +16     
==========================================
+ Hits          204      278      +74     
- Misses        226      258      +32     
- Partials        2        4       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rgaudin

👍

How much of the CSS Processor is based on warc2zim?

Once this has matured enough, I think it should move to scraperlib.

scraper/src/libretexts2zim/client.py

rgaudin · 2024-10-09T09:04:58Z

scraper/src/libretexts2zim/css.py

+            *[
+                parent.name
+                for parent in reversed(original_path.parents)
+                if parent.name and parent.name != ".."


What's this doing exactly? What's the consequence?

This allows to compute a "clean" target path to use inside the ZIM, even if original path was a relative one. E.g. for an asset located at ../folder/file.ext, it will place the file in the ZIM at folder/file.ext. This is meant to create a folder structure "similar" to the online one, but simplifying the file path to get rid of query strings for instance. I will isolate this logic in a dedicated function to add a docstring and make it clearer.

My concern is that we are removing those walking paths instead of resolving them so something like folder/sub1/sub2/sub3/sub4/../../../file.txt will end up at folder/sub1/sub2/sub3/sub4/file.ext and not at folder/file.ext.

Given all resources in-file are rewritten with full path, it should not have an impact but, as you suggested, this should all be properly documented as this would lead a future dev into a false trail on some random css issue.

rgaudin · 2024-10-09T09:11:04Z

scraper/src/libretexts2zim/css.py

+                )
+                if not relative_css_path:
+                    return
+                url_node.value = str(relative_css_path)  # pyright: ignore


Is there a reason for those wide ignores?

The reason is that there is 2 or 3 rules to ignore, I don't recall exactly, but when you specify it then it doesn't fit on one line, so black want to move things to match the 88 chars limit, and then it becomes a mess because hint is not a proper line anymore, ... Not ideal obviously.

Yeah I hate those. The problem is that those will never be fixed because the next developer will have no clue what was being ignored and this will lead to more stuff ignored, eventually weakening what we want to rely on.

I propose we keep them and disable black for those specific lines.

Here's what it would look like on that function:

def _process_node(self, css_original_url: str, node: ast.Node): """Process one single CSS node""" if isinstance( node, ast.QualifiedRule | ast.SquareBracketsBlock | ast.ParenthesesBlock | ast.CurlyBracketsBlock, ): self._process_list( css_original_url, node.content, # pyright: ignore[reportUnknownArgumentType, reportUnknownMemberType] ) elif isinstance(node, ast.FunctionBlock): if node.lower_name == "url": # pyright: ignore[reportUnknownMemberType] url_node: ast.Node = node.arguments[0] # pyright: ignore relative_css_path = self._process_url( css_original_url, url_node.value, # pyright: ignore [reportAttributeAccessIssue] ) if not relative_css_path: return url_node.value = str(relative_css_path) # pyright: ignore [reportAttributeAccessIssue] ; fmt: skip url_node.representation = f'"{serialize_url(str(relative_css_path))}"' # pyright: ignore [reportAttributeAccessIssue] ; fmt: skip else: self._process_list( css_original_url, node.arguments, ) elif isinstance(node, ast.AtRule): self._process_list( css_original_url, node.prelude, ) self._process_list( css_original_url, node.content, ) elif isinstance(node, ast.Declaration): self._process_list( css_original_url, node.value, ) elif isinstance(node, ast.URLToken): relative_css_path = self._process_url( css_original_url, node.value, ) if not relative_css_path: return node.value = str(relative_css_path) node.representation = f"url({serialize_url(str(relative_css_path))})"

Interestingly, I simply removed three of those ignores 🙃

scraper/src/libretexts2zim/css.py

benoit74 · 2024-10-10T06:44:30Z

How much of the CSS Processor is based on warc2zim?
Once this has matured enough, I think it should move to scraperlib.

This is my plan indeed. So far it is a bit of copy-and-paste indeed, but with modifications due to the specificities of libretexts. The same will happen with HTML rewriting which will be needed for images, videos, links, ... Both features are indeed not specific to warc2zim at all and primitives should definitely be shared in python-scraperlib, I'm sure we will need them again in other scrapers.

This first step takes care of CSS stylesheets which are in an external file (two indeed, one for screen and one for print). It does not consider inline CSS which is needed and will be handled in a step 2.

benoit74 self-assigned this Oct 8, 2024

benoit74 force-pushed the add_standard_css branch from c817b49 to 5ef373a Compare October 8, 2024 09:36

benoit74 marked this pull request as ready for review October 8, 2024 09:39

benoit74 force-pushed the add_pages_to_zim branch from 832e7cb to c3b6266 Compare October 8, 2024 09:41

Base automatically changed from add_pages_to_zim to main October 8, 2024 09:55

benoit74 force-pushed the add_standard_css branch from 5ef373a to b6d1d52 Compare October 8, 2024 09:56

benoit74 requested a review from rgaudin October 8, 2024 09:59

benoit74 mentioned this pull request Oct 8, 2024

Apply proper CSS for proper page display - step 2 #30

Merged

rgaudin approved these changes Oct 9, 2024

View reviewed changes

Apply proper CSS for proper page display - step 1

4749161

This first step takes care of CSS stylesheets which are in an external file (two indeed, one for screen and one for print). It does not consider inline CSS which is needed and will be handled in a step 2.

benoit74 force-pushed the add_standard_css branch from b6d1d52 to 4749161 Compare October 10, 2024 07:43

benoit74 merged commit 152e8b7 into main Oct 10, 2024
10 checks passed

benoit74 deleted the add_standard_css branch October 10, 2024 07:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply proper CSS for proper page display - step 1 #29

Apply proper CSS for proper page display - step 1 #29

benoit74 commented Oct 8, 2024 •

edited

Loading

codecov bot commented Oct 8, 2024 •

edited

Loading

rgaudin left a comment

rgaudin Oct 9, 2024

benoit74 Oct 10, 2024

rgaudin Oct 10, 2024

rgaudin Oct 9, 2024

benoit74 Oct 10, 2024

rgaudin Oct 10, 2024

benoit74 commented Oct 10, 2024

Apply proper CSS for proper page display - step 1 #29

Apply proper CSS for proper page display - step 1 #29

Conversation

benoit74 commented Oct 8, 2024 • edited Loading

codecov bot commented Oct 8, 2024 • edited Loading

Codecov Report

rgaudin left a comment

Choose a reason for hiding this comment

rgaudin Oct 9, 2024

Choose a reason for hiding this comment

benoit74 Oct 10, 2024

Choose a reason for hiding this comment

rgaudin Oct 10, 2024

Choose a reason for hiding this comment

rgaudin Oct 9, 2024

Choose a reason for hiding this comment

benoit74 Oct 10, 2024

Choose a reason for hiding this comment

rgaudin Oct 10, 2024

Choose a reason for hiding this comment

benoit74 commented Oct 10, 2024

benoit74 commented Oct 8, 2024 •

edited

Loading

codecov bot commented Oct 8, 2024 •

edited

Loading