-
Notifications
You must be signed in to change notification settings - Fork 114
Apply lua gumbo fixes #84
Apply lua gumbo fixes #84
Conversation
In readiness to be replaced by a binary search in a future commit.
Also fix a signed integer overflow in the numeric entity parsing code (which was causing 4 of the updated tests to fail).
This time with a sane build system.
All externally visible symbols should be properly prefixed/namespaced.
Header files are just a summary of types and function signatures. They're not sufficiently "creative" for copyright to even apply. Copyright and license info should only be in implementation (*.c) files.
This is required so that they can be included in C++ files (e.g. the unit tests in test/parser/*.cc).
The TAG macros just expand to designated initializers for the array index of the enum value of the named tag. Using any of these macros more than once with the same tag name within a single TagSet just produces duplicate initializers. The value used is whichever initializer appears last in the list and any prior initializers with the same index are effectively ignored. In this case the intended effects of using "TAG(TITLE)" were being canceled out by "TAG_SVG(TITLE)", which appeared later in the list.
This massively reduces the amount of work done when normalizing SVG attribute names.
@craigbarnes Awesome! I really appreciate all your help (and hard work on gumbo!) |
1c7b8a7
to
2ecdaba
Compare
There are several steps to go from an input byte stream to an element when parsing the tag name. **Old procedure** 1. UTF iterator produces code points and does replacements. 2. The results are stored in a temporary buffer, ASCII-lowercased. 3. If the tag name matches one of the ones that the HTML spec says must be handled, a GUMBO_TAG_xxx value is assigned, otherwise GUMBO_TAG_UNKNOWN is used and the normalized name is lost. 4. A start tag token is emitted and an element is created based on the tag. 5. Clients wanting to deal with the name have to go through this whole process starting with the raw bytes to work out the name. In the case of some elements in the SVG namespace, this requires additional work (see new procedure). **New procedure** - Steps 1-4 are the same except that in the case of GUMBO_TAG_UNKNOWN, a copy of the normalized tag name is saved in the token and ultimately in the element. - For known tags, the const string corresponding to the name is looked up and stored in the element. - When the HTML spec demands that certain elements in the SVG namespace have their tags adjusted, this now happens. - Clients wanting the name of the element need only access the `name` field to get the correct name in all situations. There's one element which requires special handling, `foreignObject`. In SVG contexts, its tag name is `foreignObject`, otherwise, it is `foreignobject`.
- `<!DOCTYPE html><br />` is not be an error. - `<!DOCTYPE html><p></p/>` is an error.
This tests all of the tree-construction tests except for those marked `#script-on` and those involving fragments since nokogumbo doesn't support that yet. All tests are passing.
@rubys Quick status update on this PR:
There are two issues when the headers aren't available
(Also Once these are addressed, I really think we should merge this. The improved testing alone is worth it. |
ext/nokogumbo/extconf.rb
Outdated
@@ -1,6 +1,8 @@ | |||
require 'mkmf' | |||
$CFLAGS += " -std=c99" | |||
|
|||
CONFIG['warnflags'].gsub!('-Wdeclaration-after-statement', '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you get rid of this by compiling with -std=c99
?
Edit: hmm, ignore that comment. The line above is literally += " -std=c99"
.
I wonder why your compiler is warning for something that's explicitly allowed in C99 though...?
I have no problem with dropping older versions of Ruby on a major release (e.g. 2.0). I personally don't find I don't have time at the moment to fully review each of the changes, but don't wait on me. This all looks like good stuff! Keep up the good work! Worst case, somebody notices a regression after the release, and has to temporarily pin their application to an older version of nokogumbo until it is resolved. I'm fine with that. My personal priorities at the moment is that the code works on macOS High Sierra and Ubuntu 18.04. |
The argument order for `Nokogiri::XML::CDATA::new` and `Nokogiri::XML::Text::new` are swapped.
This large PR pulls in all of @craigbarnes's fixes to gumbo by reverting all my recent patches to the error reporting, applying all of his commits, and then re-implementing my few commits.
I tried pulling the pieces in piecemeal and making individual PRs for them, but that approach didn't work. Once error reporting is fixed, we could move to a submodule and more easily track lua-gumbo's fixes to gumbo. (Or we could just stay on top of any changes and pull those in ourselves which shouldn't be too hard as long as our copies don't diverge too much.)