-
-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hugo --minify breaks HTML element extraction #7567
Comments
It looks like HTML parsing trips over the JavaScript |
This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help. |
This still happens with latest master, |
Some more triage ;)
As far as I remember you need to somehow printf the javascript code in connection with a safeJS. If you can, move the javascript into a static file outside of the layout file. I think this is a limitation of what is going on, not a bug per se. Maybe opening a thread over at discourse.gohugo.io might bring a speedy-er solution. |
Well, the simple solution is to not minify, distinction is less than a kilobyte for a small site. Workarounds around this bug are of no help, because the next user will run into it again. That's not how a reliable software should deal with issues. That said, here's a diff with Hugo 0.79.0, IIRC it used to be larger with earlier versions: $ diff -u hugo_stats.json hugo_stats.json.minify
--- hugo_stats.json 2020-12-19 21:30:30.559248875 +0100
+++ hugo_stats.json.minify 2020-12-19 21:30:20.487292074 +0100
@@ -1,7 +1,6 @@
{
"htmlElements": {
"tags": [
- "!--",
"!doctype",
"a",
"article",
@@ -23,7 +22,7 @@
"html",
"i",
"img",
- "img\n",
+ "imgdefer.length;i++){if(imgdefer[i].hasattribute('data-src')){imgdefer[i].setattribute('src',imgdefer[i].getattribute('data-src'));imgdefer[i].removeattribute('data-src');}}}\u003c/script",
"li",
"link",
"meta", |
Isn't it weird, that there is an "img" and an "img\n" in that list on un-minified pages? Could it be that some weird line ending is introduced in one of the partials and while browsers display the image nicely the HTML(XML) is not valid? Can you provide a unminified page sample and test it in a validator? https://validator.w3.org/ Hugo is only running the command. The problem is upstream in how the tags get extracted. |
Well, in HTML, a newline is perfectly valid whitespace. I often spread tags across multiple lines to make them readable.
Which command or project would this be? |
Let's ask @bep... internals are not something I can compute 👍 And regarding your answer to the two img tags... i would expect the parser to know, that a newline is a whitespace. so it should know that img and img-newline is the same. that is what I am saying. for the parser to acknowledge this as two separate instances of something might be an error in the parser, or a known inability depending on how the files get parsed. If some windowless browser is involved, it might fix things somehow. I am not saying the newline is an error in your document, I am saying it might be an indicator that the parser somehow mis-qualifies one specific img tag because there is some weird new lining going on that irritates it. |
There is no note in the code about how this get's extracted when I search for write stats |
I believe #8180 fixed this. |
With Hugo v0.81.0 (and v0.82.0), which includes #8180, it gets not perfect, but closer: $ diff -u hugo_stats.json hugo_stats.json.minify
--- hugo_stats.json 2021-04-05 16:23:07.790034356 +0200
+++ hugo_stats.json.minify 2021-04-05 16:23:17.842074077 +0200
@@ -1,10 +1,10 @@
{
"htmlElements": {
"tags": [
- "!--",
"!doctype",
"a",
"article",
+ "b.length;a++)b[a].hasattribute('data-src')\u0026\u0026(b[a].setattribute('src',b[a].getattribute('data-src')),b[a].removeattribute('data-src'))}\u003c/script",
"body",
"button",
"code", |
D'oh. I just see I was logged in with my other account. @merchantsedition, that's me :-) |
The truth comes to light ;D |
We just listen to the stream written to disk and look for HTML elements, which is extremely fast compared to doing a full scan. |
I have added a PR patch that fixes the obvious part of this bug, but without any "failing input" example, this is not possible to verify. It could be possible to move this before minify, but anyone who wants to take on that challenge needs to also write a proper benchmark. |
Thanks for the work. I just gave it a go and find no distinction in the output. To find out what's going on, I added some printf-debugging (I'm a PHP guy, so this is normal): diff --git a/publisher/htmlElementsCollector.go b/publisher/htmlElementsCollector.go
index d9479aaf..f2c35190 100644
--- a/publisher/htmlElementsCollector.go
+++ b/publisher/htmlElementsCollector.go
@@ -19,6 +19,7 @@ import (
"sort"
"strings"
"sync"
+ "fmt"
"github.com/gohugoio/hugo/helpers"
"golang.org/x/net/html"
@@ -118,7 +119,9 @@ func (w *cssClassCollectorWriter) Write(p []byte) (n int, err error) {
continue
}
+ fmt.Print("\n");
s := w.buff.String()
+ fmt.Print(s, "\n");
w.buff.Reset()
@@ -129,6 +132,7 @@ func (w *cssClassCollectorWriter) Write(p []byte) (n int, err error) {
key := s
s, tagName := w.insertStandinHTMLElement(s)
+ fmt.Print("tagName: ", tagName, "\n");
el := parseHTMLElement(s)
el.Tag = tagName
if w.isPreFormatted(tagName) { This gives the following output without
... and with
To me it looks like the mismatch happens earlier already, this entire JS snippet is recognized as tag. Only with minification, because minification removes the space between |
first find: in var defaultTdewolffConfig = tdewolffConfig{
HTML: html.Minifier{
KeepDocumentTags: true,
KeepConditionalComments: true,
KeepEndTags: true,
KeepDefaultAttrVals: true,
KeepQuotes: true, // <- should be added
KeepWhitespace: false,
}, For the other part of the "garbled" string from minifiy, I was mistaken and didn't understood the stream of data. Minify does not send the complete string or complete html tags to the html extraction. It sends token chunks of the start tag, if it has ids, classes or other attributes. It will leave end tags complete. |
Reactivating my former deleted comment: for the skipped test on minified func (w *cssClassCollectorWriter) Write(p []byte) (n int, err error) {
n = len(p)
i := 0
fmt.Printf("%v\n", string(p)) // <-- added test print for _, minify := range []bool{false, true} {
c.Run(fmt.Sprintf("%s--minify-%t", test.name, minify), func(c *qt.C) {
w := newHTMLElementsCollectorWriter(newHTMLElementsCollector())
fmt.Printf("%v\n", test.html) // <- added test print the incoming stream from minify for === RUN TestClassCollector/Script_tags_content_should_be_skipped--minify-false
<script><span>foo</span><span>bar</span></script><div class="foo"></div>
<script><span>foo</span><span>bar</span></script><div class="foo"></div>
=== RUN TestClassCollector/Script_tags_content_should_be_skipped--minify-true
<script><span>foo</span><span>bar</span></script><div class="foo"></div>
<script
>
htmlElementsCollector_test.go:126:
error:
values are not deep equal
diff (-got +want):
publisher.HTMLElements{
Tags: []string{
+ "div",
"script",
},
- Classes: nil,
+ Classes: []string{"foo"},
IDs: nil,
}
got:
publisher.HTMLElements{
Tags: {"script"},
Classes: nil,
IDs: nil,
}
want:
publisher.HTMLElements{
Tags: {"div", "script"},
Classes: {"foo"},
IDs: nil,
}
stack:
/Users/dirkolbrich/Coding/gohugo/hugo/publisher/htmlElementsCollector_test.go:126
c.Assert(got, qt.DeepEquals, test.expect) Using: ➜ publisher git:(master) ✗ hugo version
hugo v0.82.0+extended darwin/amd64 BuildDate=unknown
➜ publisher git:(master) ✗ git rev-parse --short HEAD
fa432b17 |
Closing. If you feel this is still a problem, please open a new issue with a minimal reproducible example (i.e., a repository we can clone and test with). |
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
... where 'unknown' should be 'today'.
Let code speak:
Which means, minification looses 20 HTML elements for unknown reasons. Which breaks my site :-)
Further diagnosis:
This shorter list contains this line:
This long line is certainly not a HTML tag, but similar to this chunk in one of the partials:
The text was updated successfully, but these errors were encountered: