Rewrite (static) content from warc. #133

mgautierfr · 2023-12-11T13:41:02Z

This PR now rewrite url/link in static content (html and css) to relative links.

This PR should works with "static" website. Meaning:

No js
"local" js only (only relative links, no usage of reduced url)

Fixes #122

rgaudin

Thank you ; this looks quite solid and versatile. At some point we'll probably either move or copy it to scraperlib.

Can you please use explicit fixture params so that when defining a fixture tuple, anyone knows what each part of the tuple means. Tests params should use the expanded params as well. At the moment, reviewing tests is difficult.

Haven't ran the tests ; awaiting the pylibzim release 🤓

rgaudin · 2023-12-12T11:04:01Z

src/warc2zim/utils.py

@@ -41,7 +41,7 @@ def parse_title(content):

 def to_string(input: str | bytes) -> str:
    try:
-        input = input.decode("utf8")
+        input = input.decode("utf-8-sig")


What's your need for this? I believe it's ~~deprecated~~ advised against

https://docs.python.org/3/library/codecs.html#encodings-and-unicode

Sadly, we have to handle the BOM anyway as it may be present in the content we get:

import requests css = requests.get("https://donorbox.org/assets/application_embed-47da8f7456acb6aa58b61f2e5c664fccbf3cae5b0ad587f129dcd2d93caa65e8.css").content print(content[:20])

For this particular use case it doesn't matter (it's just a zero width space) but for URLs it could be problematic as a typable url/path would become un-typable.
Doesn't harm anyway...

But it break the parsing of the css:

import tinycss2 css = requests.get(...) # print `<QualifiedRule … { … }>` (skipping the first (at) rule) print(tinycss2.parse_stylesheet(css.decode('utf-8'))[0]) # print `<AtRule @import … { … }>` print(tinycss2.parse_stylesheet(css.decode('utf-8-sig'))[0])

Indeed the first rule doesn't appear in content (but is in prelude and thus serialized – but we wouldn't rewrite it!). Looks like we should open a ticket upstream.
Can you add a brief comment explaining tinycss2 doesn't handle it correctly? Including that sample URL would help I think.

Issue opened upstream : Kozea/tinycss2#52

However, I'm not sure it is a bug in tinycss. It is more us passing a not correctly decoded content to tinycss.

Can you add a brief comment explaining tinycss2 doesn't handle it correctly? Including that sample URL would help I think.

I will add a link to the tinycss2 issue as a comment.

However, I'm not sure it is a bug in tinycss. It is more us passing a not correctly decoded content to tinycss.

We'll see what they think about. It will probably come down to what the CSS spec says. I wonder if browsers take care of it before sending it to parser or not.

src/warc2zim/items.py

src/warc2zim/content_rewriting.py

src/warc2zim/url_rewriting.py

src/warc2zim/content_rewriting.py

rgaudin · 2023-12-12T11:50:45Z

tests/test_html_rewriting.py

+        "A simple string without url",
+        "<html><body><p>This is a sentence with a http://exemple.com/path link</p></body></html>",
+        '<a data-source="http://exemple.com/path">A link we should not rewrite</a>',
+        'p style="background: url(some/image.png)">A link in a inline style</p>',


Suggested change

'p style="background: url(some/image.png)">A link in a inline style</p>',

'p style="background: url(some/image.png)">A URL (not really) in a inline style</p>',

You changed the label to A url (relative) in a inline style but it's still misleading. It looks like this should be rewrote but the fact that it's not has nothing to do with it being relative of not but the fact that it's not an actual inline style attr because this is outside a node.

Haaa, but it should be in a node. It missing a < at the beginning of the string.

src/warc2zim/content_rewriting.py

mgautierfr · 2023-12-14T16:21:34Z

@rgaudin's comments fixed in fixup! commits

rgaudin · 2023-12-15T10:53:17Z

Why did you force-push if you were using fixup commits?

From the quick look I took:

don't import posixpath, it's recommended not to and its not directly documented. I believe PurePosixPath should provide same functionality without importing it. Why did you change this part? Was it because of walk_up?
You did not change the test fixtures. Did you miss my comment, forget about it or are you against this change?

mgautierfr · 2023-12-15T11:13:07Z

Why did you force-push if you were using fixup commits?

Because the fixup commits are in the middle of the git history.
This is to avoid conflict that would have to been fixed at the final rebase (without letting you reviewing it). Now fixup are "deterministic".

From the quick look I took:

don't import posixpath, it's recommended not to and its not directly documented. I believe PurePosixPath should provide same functionality without importing it. Why did you change this part? Was it because of walk_up?

Yes, (posix)path.relpath is walking up.
(PurePosix)Path is not (or you need to pass the walk_up).

I'm importing posixpath instead of path as we want to always use posix path, even if warc2zim is run on windows.

You did not change the test fixtures. Did you miss my comment, forget about it or are you against this change?

I've forget about it.

mgautierfr · 2023-12-15T16:11:01Z

New fixup commit (d693a44) add the < at beginning of test.

Last commit introduces TestContent as a fixture tuple. But I think it doesn't help a lot, maybe I'm a bit too much in the tests and I don't totally see the improvement.
And now, a lot of test are the same, just using different "fixture set". We could move all of them into one test/fixture set but we would lost the information about what is actually tested.

rgaudin · 2023-12-18T10:41:59Z

src/warc2zim/content_rewriting.py

+from warc2zim.utils import to_string
+from typing import Callable, Optional, Iterable
+
+type AttrsList = list[tuple[str, Optional[str]]]


This is py3.12 only AFAIK

rgaudin · 2023-12-18T10:55:12Z

src/warc2zim/content_rewriting.py

@@ -0,0 +1,170 @@
+from html import escape
+from html.parser import HTMLParser
+from tinycss2 import parse_stylesheet, parse_declaration_list, serialize


tinycss has not been added to requirements

rgaudin · 2023-12-18T11:25:22Z

Thank you @mgautierfr it's clearer now ; I've changed your nametuple thing to something simpler.
I've noticed one type def that's py3.12 only and that tinycss2 is not in reqs.

You should be able to rebase from norm and thus execute tests. Once those are green, we should be OK

mgautierfr · 2023-12-18T14:07:23Z

I've noticed one type def that's py3.12 only and that tinycss2 is not in reqs.

Fixed in two small (last) fixup commits. Rebased on norm.

codecov · 2023-12-18T15:48:52Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (3f3eb34) 86.23% compared to head (07664ea) 89.72%.

Files	Patch %	Lines
src/warc2zim/content_rewriting.py	98.42%	2 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##           warc2zim2     #133      +/-   ##
=============================================
+ Coverage      86.23%   89.72%   +3.49%     
=============================================
  Files              5        6       +1     
  Lines            414      555     +141     
  Branches          65       89      +24     
=============================================
+ Hits             357      498     +141     
- Misses            46       48       +2     
+ Partials          11        9       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rgaudin

All green ; @mgautierfr you can rebase and merge.

rgaudin · 2023-12-19T10:07:34Z

oh and my commits are gone 😐

Head insert has been (temporarily) removed (to be readded in next commit).

Content may contain BOM. `utf-8-sig` handle it. Default `utf8` encoding keep it in the decoded string and break parsing.

…cted.

mgautierfr · 2023-12-19T10:50:20Z

Rebased/fixed-up.

oh and my commits are gone 😐

Sorry about that, I haven't pull before rebasing. Hope everything is good now.

rgaudin · 2023-12-19T10:57:00Z

The rest was details and I reset yesterday so it's OK 👍 Let's merge!

mgautierfr requested review from rgaudin and benoit74 December 11, 2023 13:41

mgautierfr force-pushed the content_rewriting branch from 05f9e19 to fe2d583 Compare December 11, 2023 14:00

mgautierfr force-pushed the url_normalization branch from cdba15c to 4654a55 Compare December 11, 2023 14:34

mgautierfr force-pushed the content_rewriting branch from fe2d583 to de09f5c Compare December 11, 2023 15:34

rgaudin requested changes Dec 12, 2023

View reviewed changes

mgautierfr force-pushed the content_rewriting branch from de09f5c to 99286e7 Compare December 14, 2023 16:21

mgautierfr requested a review from rgaudin December 14, 2023 16:21

mgautierfr force-pushed the content_rewriting branch from 99286e7 to d3d3b80 Compare December 15, 2023 15:57

rgaudin linked an issue Dec 16, 2023 that may be closed by this pull request

Statically rewrite url in html and css content. #122

Closed

rgaudin reviewed Dec 18, 2023

View reviewed changes

mgautierfr force-pushed the content_rewriting branch from c4f8a11 to cd3f45c Compare December 18, 2023 14:06

mgautierfr force-pushed the content_rewriting branch 7 times, most recently from 63cfb53 to 40d924e Compare December 18, 2023 15:02

mgautierfr force-pushed the url_normalization branch from d1a7d63 to 0bfddc9 Compare December 18, 2023 15:18

mgautierfr force-pushed the content_rewriting branch from 3e24bd7 to b39df02 Compare December 18, 2023 15:47

Base automatically changed from url_normalization to warc2zim2 December 19, 2023 02:11

mgautierfr force-pushed the content_rewriting branch from b39df02 to 4d67913 Compare December 19, 2023 08:57

rgaudin self-requested a review December 19, 2023 10:04

rgaudin approved these changes Dec 19, 2023

View reviewed changes

mgautierfr and others added 9 commits December 19, 2023 11:43

Rewrite url in html content.

4e8aa7d

Head insert has been (temporarily) removed (to be readded in next commit).

Do not rewrite data: and blob: url

629f878

Readd head and css insert.

d353cf9

Introduce CSS rewriting

3133337

Rewrite CSS embeded in html.

3ccb49f

Decode bytes using encoding utf-8-sig

8dfd19d

Content may contain BOM. `utf-8-sig` handle it. Default `utf8` encoding keep it in the decoded string and break parsing.

Introduce TestContent to represent what to rewrite and what is expe…

6bc60ea

…cted.

Pattern matching is python 3.10 only.

c00a576

TestContent using dataclass

07664ea

mgautierfr force-pushed the content_rewriting branch from 5563058 to 07664ea Compare December 19, 2023 10:49

mgautierfr merged commit 319d502 into warc2zim2 Dec 19, 2023
12 checks passed

mgautierfr deleted the content_rewriting branch December 19, 2023 11:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite (static) content from warc. #133

Rewrite (static) content from warc. #133

mgautierfr commented Dec 11, 2023 •

edited by kelson42

Loading

rgaudin left a comment

rgaudin Dec 12, 2023

mgautierfr Dec 12, 2023

rgaudin Dec 12, 2023

mgautierfr Dec 14, 2023

rgaudin Dec 14, 2023

mgautierfr Dec 15, 2023

rgaudin Dec 15, 2023

rgaudin Dec 12, 2023

rgaudin Dec 15, 2023

mgautierfr Dec 15, 2023

mgautierfr commented Dec 14, 2023

rgaudin commented Dec 15, 2023

mgautierfr commented Dec 15, 2023

mgautierfr commented Dec 15, 2023

rgaudin Dec 18, 2023

rgaudin Dec 18, 2023

rgaudin commented Dec 18, 2023

mgautierfr commented Dec 18, 2023

codecov bot commented Dec 18, 2023 •

edited

Loading

rgaudin left a comment

rgaudin commented Dec 19, 2023

mgautierfr commented Dec 19, 2023

rgaudin commented Dec 19, 2023

	'p style="background: url(some/image.png)">A link in a inline style</p>',
	'p style="background: url(some/image.png)">A URL (not really) in a inline style</p>',

Rewrite (static) content from warc. #133

Rewrite (static) content from warc. #133

Conversation

mgautierfr commented Dec 11, 2023 • edited by kelson42 Loading

rgaudin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgautierfr commented Dec 14, 2023

rgaudin commented Dec 15, 2023

mgautierfr commented Dec 15, 2023

mgautierfr commented Dec 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rgaudin commented Dec 18, 2023

mgautierfr commented Dec 18, 2023

codecov bot commented Dec 18, 2023 • edited Loading

Codecov Report

rgaudin left a comment

Choose a reason for hiding this comment

rgaudin commented Dec 19, 2023

mgautierfr commented Dec 19, 2023

rgaudin commented Dec 19, 2023

mgautierfr commented Dec 11, 2023 •

edited by kelson42

Loading

codecov bot commented Dec 18, 2023 •

edited

Loading