-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing robustness #8436
Parsing robustness #8436
Conversation
In case somebody wants to enhance the testsuite, here are additional cases:
Yes, newlines and tab vs. space matters :-) Failure of each is recognized by non-alphabetical characters appearing in the |
|
Thanks for this. There are failing tests. It looks mostly to be elements inside pre/textarea, which I don't see any reason to collect (but please prove me wrong). Also, it would be great if you could
name old time/op new time/op delta
ClassCollectorWriter-16 21.2µs ± 2% 47.3µs ± 1% +123.21% (p=0.029 n=4+4)
name old alloc/op new alloc/op delta
ClassCollectorWriter-16 35.3kB ± 0% 86.4kB ± 0% +144.95% (p=0.029 n=4+4)
name old allocs/op new allocs/op delta
ClassCollectorWriter-16 155 ± 0% 329 ± 0% +112.26% (p=0.029 n=4+4) Note that I don't mind it getting slower if really needed, but I fail to understand the motivation behind the removal of |
Can you check whether the latest commits re this in the master branch covers your needs? |
Thanks for the hint on how to run these testcases easily. I just pushed a reviewed set of commits, rebased on current master.
Well, the simple reason for these failures was me removing special handling of pre/textarea, but not knowing how to remove these tests as well. That said, removing these was a mistake, I misunderstood the intention of this code earlier. It's not about not collecting non-styleable tags, but about ignoring stuff inside them. Accordingly, I melted down the first commit to just adding another two such tags with preformatted code. Where <title> isn't exactly preformatted, but of no use for styling anyways.
Well, some, but not all. This time, knowing how to deal with these tests, I managed to find a testcase for each. Committed testcases separately to allow seeing them failing. Luckily, I also found a fix for each. Also one or two testcases which fell into the keyboard somehow, without a matching reported issue I'd be aware of :-)
This time I moved the commit with this removal, along with a commit with a demonstrating (previously failing, now working) testcase just before it, all to the end, to allow (cherry-)picking earlier commits. The problem to solve appears when such a preformatted tag appears twice and contains a Perhaps you have a better idea on how to get the best of both, quick processing as well as recognizing repeated preformatted tags. |
These are two of the symptoms reported with #7567. The second test passes already. The first test fails, because the '<' lets the parsing logic start a new tag, which only ends on the actual closing tag. This means garbage in front of the actual closing tag, making parseEndTag() fail to recognize it as a closing tag.
This stops parsing of 'preformatted' tags not on any tag found, but only on finding the matching end tag. Key element here is looking at the end of a tag (strings.HasSuffix()), rather than the entire tag. Remainder is simplification of the logic flow. On what failed before, see also previous commit. This is related to issue #7567 and fixes the failing test case introduced with the previous commit.
First case actually fails. Second one tests against a situation I had during development of the previous commit (forgotten ToLower() in parseEndTag()).
This makes the previously introduced testcase pass.
A regular whitespace, a tab and a newline are all equivelant and perfectly valid whitespace characters. Solve this by using strings.Fields(), which considers all these. This appears to solve #8417
Well, uhm, err, ... also failing.
Same for the other table-related tags, of course. This makes the testcase introduced with the previous commit pass.
Here, the strategy to avoid processing tags twice fails. On the first round, everything works properly. Next round, the same tag is no longer recognized as a preformatted tag due to the shortcut taken, the mess happens again.
This fixes the previously introduced testcase, details see there. Also helps with issue #7567.
I just rebased the pull request branch to tag v0.83.1 and force-pushed it. |
Thanks, but I've already spent a considerably amount of time simplifying this in another. Would welcome any failed test cases after that PR is merged. |
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Fallout of a hacking Sunday. Tackles #7567 and #8417.
Tested against my own project, only, as I'm not aware how the testsuite works.
Not so sure about commit 0d91578, "don't shortcut on previously seen elements", as I'm not sure how much performance impact it has. It solves handling ignored tags, but probably needs either removal of then dead code or a different solution for dealing with ignored tags appearing multiple times.