Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite (static) content from warc. #133

Merged
merged 9 commits into from
Dec 19, 2023
Merged

Conversation

mgautierfr
Copy link
Contributor

@mgautierfr mgautierfr commented Dec 11, 2023

This PR now rewrite url/link in static content (html and css) to relative links.

This PR should works with "static" website. Meaning:

  • No js
  • "local" js only (only relative links, no usage of reduced url)

Fixes #122

Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ; this looks quite solid and versatile. At some point we'll probably either move or copy it to scraperlib.

Can you please use explicit fixture params so that when defining a fixture tuple, anyone knows what each part of the tuple means. Tests params should use the expanded params as well. At the moment, reviewing tests is difficult.

Haven't ran the tests ; awaiting the pylibzim release 🤓

@@ -41,7 +41,7 @@ def parse_title(content):

def to_string(input: str | bytes) -> str:
try:
input = input.decode("utf8")
input = input.decode("utf-8-sig")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your need for this? I believe it's deprecated advised against

https://docs.python.org/3/library/codecs.html#encodings-and-unicode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, we have to handle the BOM anyway as it may be present in the content we get:

import requests

css = requests.get("https://donorbox.org/assets/application_embed-47da8f7456acb6aa58b61f2e5c664fccbf3cae5b0ad587f129dcd2d93caa65e8.css").content

print(content[:20])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this particular use case it doesn't matter (it's just a zero width space) but for URLs it could be problematic as a typable url/path would become un-typable.
Doesn't harm anyway...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it break the parsing of the css:

import tinycss2

css = requests.get(...)

# print `<QualifiedRule … { … }>` (skipping the first (at) rule)
print(tinycss2.parse_stylesheet(css.decode('utf-8'))[0])

# print `<AtRule @import … { … }>`
print(tinycss2.parse_stylesheet(css.decode('utf-8-sig'))[0])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed the first rule doesn't appear in content (but is in prelude and thus serialized – but we wouldn't rewrite it!). Looks like we should open a ticket upstream.
Can you add a brief comment explaining tinycss2 doesn't handle it correctly? Including that sample URL would help I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue opened upstream : Kozea/tinycss2#52

However, I'm not sure it is a bug in tinycss. It is more us passing a not correctly decoded content to tinycss.

Can you add a brief comment explaining tinycss2 doesn't handle it correctly? Including that sample URL would help I think.

I will add a link to the tinycss2 issue as a comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I'm not sure it is a bug in tinycss. It is more us passing a not correctly decoded content to tinycss.

We'll see what they think about. It will probably come down to what the CSS spec says. I wonder if browsers take care of it before sending it to parser or not.

src/warc2zim/items.py Outdated Show resolved Hide resolved
src/warc2zim/content_rewriting.py Outdated Show resolved Hide resolved
src/warc2zim/content_rewriting.py Outdated Show resolved Hide resolved
src/warc2zim/url_rewriting.py Outdated Show resolved Hide resolved
src/warc2zim/url_rewriting.py Outdated Show resolved Hide resolved
src/warc2zim/content_rewriting.py Outdated Show resolved Hide resolved
src/warc2zim/content_rewriting.py Show resolved Hide resolved
"A simple string without url",
"<html><body><p>This is a sentence with a http://exemple.com/path link</p></body></html>",
'<a data-source="http://exemple.com/path">A link we should not rewrite</a>',
'p style="background: url(some/image.png)">A link in a inline style</p>',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'p style="background: url(some/image.png)">A link in a inline style</p>',
'p style="background: url(some/image.png)">A URL (not really) in a inline style</p>',

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You changed the label to A url (relative) in a inline style but it's still misleading. It looks like this should be rewrote but the fact that it's not has nothing to do with it being relative of not but the fact that it's not an actual inline style attr because this is outside a node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haaa, but it should be in a node. It missing a < at the beginning of the string.

src/warc2zim/content_rewriting.py Outdated Show resolved Hide resolved
@mgautierfr
Copy link
Contributor Author

@rgaudin's comments fixed in fixup! commits

@mgautierfr mgautierfr requested a review from rgaudin December 14, 2023 16:21
@rgaudin
Copy link
Member

rgaudin commented Dec 15, 2023

Why did you force-push if you were using fixup commits?

From the quick look I took:

  • don't import posixpath, it's recommended not to and its not directly documented. I believe PurePosixPath should provide same functionality without importing it. Why did you change this part? Was it because of walk_up?
  • You did not change the test fixtures. Did you miss my comment, forget about it or are you against this change?

@mgautierfr
Copy link
Contributor Author

Why did you force-push if you were using fixup commits?

Because the fixup commits are in the middle of the git history.
This is to avoid conflict that would have to been fixed at the final rebase (without letting you reviewing it). Now fixup are "deterministic".

From the quick look I took:

don't import posixpath, it's recommended not to and its not directly documented. I believe PurePosixPath should provide same functionality without importing it. Why did you change this part? Was it because of walk_up?

Yes, (posix)path.relpath is walking up.
(PurePosix)Path is not (or you need to pass the walk_up).

I'm importing posixpath instead of path as we want to always use posix path, even if warc2zim is run on windows.

You did not change the test fixtures. Did you miss my comment, forget about it or are you against this change?

I've forget about it.

@mgautierfr
Copy link
Contributor Author

New fixup commit (d693a44) add the < at beginning of test.

Last commit introduces TestContent as a fixture tuple. But I think it doesn't help a lot, maybe I'm a bit too much in the tests and I don't totally see the improvement.
And now, a lot of test are the same, just using different "fixture set". We could move all of them into one test/fixture set but we would lost the information about what is actually tested.

@rgaudin rgaudin linked an issue Dec 16, 2023 that may be closed by this pull request
from warc2zim.utils import to_string
from typing import Callable, Optional, Iterable

type AttrsList = list[tuple[str, Optional[str]]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is py3.12 only AFAIK

@@ -0,0 +1,170 @@
from html import escape
from html.parser import HTMLParser
from tinycss2 import parse_stylesheet, parse_declaration_list, serialize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tinycss has not been added to requirements

@rgaudin
Copy link
Member

rgaudin commented Dec 18, 2023

Thank you @mgautierfr it's clearer now ; I've changed your nametuple thing to something simpler.
I've noticed one type def that's py3.12 only and that tinycss2 is not in reqs.

You should be able to rebase from norm and thus execute tests. Once those are green, we should be OK

@mgautierfr
Copy link
Contributor Author

I've noticed one type def that's py3.12 only and that tinycss2 is not in reqs.

Fixed in two small (last) fixup commits. Rebased on norm.

@mgautierfr mgautierfr force-pushed the content_rewriting branch 7 times, most recently from 63cfb53 to 40d924e Compare December 18, 2023 15:02
Copy link

codecov bot commented Dec 18, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (3f3eb34) 86.23% compared to head (07664ea) 89.72%.

Files Patch % Lines
src/warc2zim/content_rewriting.py 98.42% 2 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##           warc2zim2     #133      +/-   ##
=============================================
+ Coverage      86.23%   89.72%   +3.49%     
=============================================
  Files              5        6       +1     
  Lines            414      555     +141     
  Branches          65       89      +24     
=============================================
+ Hits             357      498     +141     
- Misses            46       48       +2     
+ Partials          11        9       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Base automatically changed from url_normalization to warc2zim2 December 19, 2023 02:11
@rgaudin rgaudin self-requested a review December 19, 2023 10:04
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All green ; @mgautierfr you can rebase and merge.

@rgaudin
Copy link
Member

rgaudin commented Dec 19, 2023

oh and my commits are gone 😐

@mgautierfr
Copy link
Contributor Author

Rebased/fixed-up.

oh and my commits are gone 😐

Sorry about that, I haven't pull before rebasing. Hope everything is good now.

@rgaudin
Copy link
Member

rgaudin commented Dec 19, 2023

The rest was details and I reset yesterday so it's OK 👍 Let's merge!

@mgautierfr mgautierfr merged commit 319d502 into warc2zim2 Dec 19, 2023
12 checks passed
@mgautierfr mgautierfr deleted the content_rewriting branch December 19, 2023 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Statically rewrite url in html and css content.
2 participants