-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite (static) content from warc. #133
Conversation
05f9e19
to
fe2d583
Compare
cdba15c
to
4654a55
Compare
fe2d583
to
de09f5c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you ; this looks quite solid and versatile. At some point we'll probably either move or copy it to scraperlib.
Can you please use explicit fixture params so that when defining a fixture tuple, anyone knows what each part of the tuple means. Tests params should use the expanded params as well. At the moment, reviewing tests is difficult.
Haven't ran the tests ; awaiting the pylibzim release 🤓
@@ -41,7 +41,7 @@ def parse_title(content): | |||
|
|||
def to_string(input: str | bytes) -> str: | |||
try: | |||
input = input.decode("utf8") | |||
input = input.decode("utf-8-sig") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your need for this? I believe it's deprecated advised against
https://docs.python.org/3/library/codecs.html#encodings-and-unicode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, we have to handle the BOM anyway as it may be present in the content we get:
import requests
css = requests.get("https://donorbox.org/assets/application_embed-47da8f7456acb6aa58b61f2e5c664fccbf3cae5b0ad587f129dcd2d93caa65e8.css").content
print(content[:20])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this particular use case it doesn't matter (it's just a zero width space) but for URLs it could be problematic as a typable url/path would become un-typable.
Doesn't harm anyway...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it break the parsing of the css:
import tinycss2
css = requests.get(...)
# print `<QualifiedRule … { … }>` (skipping the first (at) rule)
print(tinycss2.parse_stylesheet(css.decode('utf-8'))[0])
# print `<AtRule @import … { … }>`
print(tinycss2.parse_stylesheet(css.decode('utf-8-sig'))[0])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed the first rule doesn't appear in content (but is in prelude and thus serialized – but we wouldn't rewrite it!). Looks like we should open a ticket upstream.
Can you add a brief comment explaining tinycss2 doesn't handle it correctly? Including that sample URL would help I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue opened upstream : Kozea/tinycss2#52
However, I'm not sure it is a bug in tinycss. It is more us passing a not correctly decoded content to tinycss.
Can you add a brief comment explaining tinycss2 doesn't handle it correctly? Including that sample URL would help I think.
I will add a link to the tinycss2 issue as a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I'm not sure it is a bug in tinycss. It is more us passing a not correctly decoded content to tinycss.
We'll see what they think about. It will probably come down to what the CSS spec says. I wonder if browsers take care of it before sending it to parser or not.
tests/test_html_rewriting.py
Outdated
"A simple string without url", | ||
"<html><body><p>This is a sentence with a http://exemple.com/path link</p></body></html>", | ||
'<a data-source="http://exemple.com/path">A link we should not rewrite</a>', | ||
'p style="background: url(some/image.png)">A link in a inline style</p>', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'p style="background: url(some/image.png)">A link in a inline style</p>', | |
'p style="background: url(some/image.png)">A URL (not really) in a inline style</p>', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You changed the label to A url (relative) in a inline style
but it's still misleading. It looks like this should be rewrote but the fact that it's not has nothing to do with it being relative of not but the fact that it's not an actual inline style attr because this is outside a node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haaa, but it should be in a node. It missing a <
at the beginning of the string.
de09f5c
to
99286e7
Compare
@rgaudin's comments fixed in fixup! commits |
Why did you force-push if you were using fixup commits? From the quick look I took:
|
Because the fixup commits are in the middle of the git history. From the quick look I took:
Yes, I'm importing
I've forget about it. |
99286e7
to
d3d3b80
Compare
New fixup commit (d693a44) add the Last commit introduces |
src/warc2zim/content_rewriting.py
Outdated
from warc2zim.utils import to_string | ||
from typing import Callable, Optional, Iterable | ||
|
||
type AttrsList = list[tuple[str, Optional[str]]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is py3.12 only AFAIK
src/warc2zim/content_rewriting.py
Outdated
@@ -0,0 +1,170 @@ | |||
from html import escape | |||
from html.parser import HTMLParser | |||
from tinycss2 import parse_stylesheet, parse_declaration_list, serialize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tinycss has not been added to requirements
Thank you @mgautierfr it's clearer now ; I've changed your nametuple thing to something simpler. You should be able to rebase from norm and thus execute tests. Once those are green, we should be OK |
c4f8a11
to
cd3f45c
Compare
Fixed in two small (last) fixup commits. Rebased on norm. |
63cfb53
to
40d924e
Compare
d1a7d63
to
0bfddc9
Compare
3e24bd7
to
b39df02
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## warc2zim2 #133 +/- ##
=============================================
+ Coverage 86.23% 89.72% +3.49%
=============================================
Files 5 6 +1
Lines 414 555 +141
Branches 65 89 +24
=============================================
+ Hits 357 498 +141
- Misses 46 48 +2
+ Partials 11 9 -2 ☔ View full report in Codecov by Sentry. |
b39df02
to
4d67913
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All green ; @mgautierfr you can rebase and merge.
oh and my commits are gone 😐 |
Head insert has been (temporarily) removed (to be readded in next commit).
Content may contain BOM. `utf-8-sig` handle it. Default `utf8` encoding keep it in the decoded string and break parsing.
5563058
to
07664ea
Compare
Rebased/fixed-up.
Sorry about that, I haven't pull before rebasing. Hope everything is good now. |
The rest was details and I reset yesterday so it's OK 👍 Let's merge! |
This PR now rewrite url/link in static content (html and css) to relative links.
This PR should works with "static" website. Meaning:
Fixes #122