Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply proper CSS for proper page display - step 1 #29

Merged
merged 1 commit into from
Oct 10, 2024
Merged

Conversation

benoit74
Copy link
Contributor

@benoit74 benoit74 commented Oct 8, 2024

This is part of #8

This first step takes care of CSS stylesheets which are in an external file (two indeed, one for screen and one for print).

It handles fetching all assets (images, fonts) referenced in the CSS.

It handles rewriting of the CSS to fix URLs.

It does not consider inline CSS which is needed and will be handled in a step 2.

@benoit74 benoit74 self-assigned this Oct 8, 2024
Copy link

codecov bot commented Oct 8, 2024

Codecov Report

Attention: Patch coverage is 68.51852% with 34 lines in your changes missing coverage. Please review.

Project coverage is 51.48%. Comparing base (733c35a) to head (4749161).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
scraper/src/libretexts2zim/processor.py 8.69% 21 Missing ⚠️
scraper/src/libretexts2zim/client.py 35.71% 9 Missing ⚠️
scraper/src/libretexts2zim/css.py 93.22% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #29      +/-   ##
==========================================
+ Coverage   47.22%   51.48%   +4.25%     
==========================================
  Files           7        9       +2     
  Lines         432      540     +108     
  Branches       45       61      +16     
==========================================
+ Hits          204      278      +74     
- Misses        226      258      +32     
- Partials        2        4       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74 benoit74 marked this pull request as ready for review October 8, 2024 09:39
Base automatically changed from add_pages_to_zim to main October 8, 2024 09:55
@benoit74 benoit74 requested a review from rgaudin October 8, 2024 09:59
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

How much of the CSS Processor is based on warc2zim?

Once this has matured enough, I think it should move to scraperlib.

scraper/src/libretexts2zim/client.py Outdated Show resolved Hide resolved
*[
parent.name
for parent in reversed(original_path.parents)
if parent.name and parent.name != ".."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this doing exactly? What's the consequence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows to compute a "clean" target path to use inside the ZIM, even if original path was a relative one. E.g. for an asset located at ../folder/file.ext, it will place the file in the ZIM at folder/file.ext. This is meant to create a folder structure "similar" to the online one, but simplifying the file path to get rid of query strings for instance. I will isolate this logic in a dedicated function to add a docstring and make it clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that we are removing those walking paths instead of resolving them so something like folder/sub1/sub2/sub3/sub4/../../../file.txt will end up at folder/sub1/sub2/sub3/sub4/file.ext and not at folder/file.ext.

Given all resources in-file are rewritten with full path, it should not have an impact but, as you suggested, this should all be properly documented as this would lead a future dev into a false trail on some random css issue.

)
if not relative_css_path:
return
url_node.value = str(relative_css_path) # pyright: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for those wide ignores?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that there is 2 or 3 rules to ignore, I don't recall exactly, but when you specify it then it doesn't fit on one line, so black want to move things to match the 88 chars limit, and then it becomes a mess because hint is not a proper line anymore, ... Not ideal obviously.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I hate those. The problem is that those will never be fixed because the next developer will have no clue what was being ignored and this will lead to more stuff ignored, eventually weakening what we want to rely on.

I propose we keep them and disable black for those specific lines.

Here's what it would look like on that function:

    def _process_node(self, css_original_url: str, node: ast.Node):
        """Process one single CSS node"""
        if isinstance(
            node,
            ast.QualifiedRule
            | ast.SquareBracketsBlock
            | ast.ParenthesesBlock
            | ast.CurlyBracketsBlock,
        ):
            self._process_list(
                css_original_url,
                node.content,  # pyright: ignore[reportUnknownArgumentType, reportUnknownMemberType]
            )
        elif isinstance(node, ast.FunctionBlock):
            if node.lower_name == "url":  # pyright: ignore[reportUnknownMemberType]
                url_node: ast.Node = node.arguments[0]  # pyright: ignore
                relative_css_path = self._process_url(
                    css_original_url,
                    url_node.value,  # pyright: ignore [reportAttributeAccessIssue]
                )
                if not relative_css_path:
                    return
                url_node.value = str(relative_css_path)  # pyright: ignore [reportAttributeAccessIssue] ; fmt: skip
                url_node.representation = f'"{serialize_url(str(relative_css_path))}"'  # pyright: ignore [reportAttributeAccessIssue] ; fmt: skip

            else:
                self._process_list(
                    css_original_url,
                    node.arguments,
                )
        elif isinstance(node, ast.AtRule):
            self._process_list(
                css_original_url,
                node.prelude,
            )
            self._process_list(
                css_original_url,
                node.content,
            )
        elif isinstance(node, ast.Declaration):
            self._process_list(
                css_original_url,
                node.value,
            )
        elif isinstance(node, ast.URLToken):
            relative_css_path = self._process_url(
                css_original_url,
                node.value,
            )
            if not relative_css_path:
                return
            node.value = str(relative_css_path)
            node.representation = f"url({serialize_url(str(relative_css_path))})"

Interestingly, I simply removed three of those ignores 🙃

scraper/src/libretexts2zim/css.py Show resolved Hide resolved
@benoit74
Copy link
Contributor Author

How much of the CSS Processor is based on warc2zim?
Once this has matured enough, I think it should move to scraperlib.

This is my plan indeed. So far it is a bit of copy-and-paste indeed, but with modifications due to the specificities of libretexts. The same will happen with HTML rewriting which will be needed for images, videos, links, ... Both features are indeed not specific to warc2zim at all and primitives should definitely be shared in python-scraperlib, I'm sure we will need them again in other scrapers.

This first step takes care of CSS stylesheets which are in an external
file (two indeed, one for screen and one for print).

It does not consider inline CSS which is needed and will be handled in
a step 2.
@benoit74 benoit74 merged commit 152e8b7 into main Oct 10, 2024
10 checks passed
@benoit74 benoit74 deleted the add_standard_css branch October 10, 2024 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants