Skip to content
This repository has been archived by the owner on Aug 26, 2023. It is now read-only.

Apply lua gumbo fixes #84

Merged
merged 166 commits into from
Aug 22, 2018

Conversation

stevecheckoway
Copy link
Collaborator

This large PR pulls in all of @craigbarnes's fixes to gumbo by reverting all my recent patches to the error reporting, applying all of his commits, and then re-implementing my few commits.

I tried pulling the pieces in piecemeal and making individual PRs for them, but that approach didn't work. Once error reporting is fixed, we could move to a submodule and more easily track lua-gumbo's fixes to gumbo. (Or we could just stay on top of any changes and pull those in ourselves which shouldn't be too hard as long as our copies don't diverge too much.)

stevecheckoway and others added 30 commits August 19, 2018 17:04
In readiness to be replaced by a binary search in a future commit.
Also fix a signed integer overflow in the numeric entity parsing code
(which was causing 4 of the updated tests to fail).
All externally visible symbols should be properly prefixed/namespaced.
Header files are just a summary of types and function signatures.
They're not sufficiently "creative" for copyright to even apply.
Copyright and license info should only be in implementation (*.c)
files.
This is required so that they can be included in C++ files
(e.g. the unit tests in test/parser/*.cc).
The TAG macros just expand to designated initializers for the array
index of the enum value of the named tag. Using any of these macros more
than once with the same tag name within a single TagSet just produces
duplicate initializers.

The value used is whichever initializer appears last in the list and any
prior initializers with the same index are effectively ignored. In this
case the intended effects of using "TAG(TITLE)" were being canceled out
by "TAG_SVG(TITLE)", which appeared later in the list.
This massively reduces the amount of work done when normalizing SVG
attribute names.
@stevecheckoway
Copy link
Collaborator Author

@craigbarnes Awesome! I really appreciate all your help (and hard work on gumbo!)

There are several steps to go from an input byte stream to an element
when parsing the tag name.

**Old procedure**

1. UTF iterator produces code points and does replacements.
2. The results are stored in a temporary buffer, ASCII-lowercased.
3. If the tag name matches one of the ones that the HTML spec says must
   be handled, a GUMBO_TAG_xxx value is assigned, otherwise
   GUMBO_TAG_UNKNOWN is used and the normalized name is lost.
4. A start tag token is emitted and an element is created based on the
   tag.
5. Clients wanting to deal with the name have to go through this whole
   process starting with the raw bytes to work out the name. In the case
   of some elements in the SVG namespace, this requires additional work
   (see new procedure).

**New procedure**
- Steps 1-4 are the same except that in the case of
  GUMBO_TAG_UNKNOWN, a copy of the normalized tag name is saved in the
  token and ultimately in the element.
- For known tags, the const string corresponding to the name is looked
  up and stored in the element.
- When the HTML spec demands that certain elements in the SVG namespace
  have their tags adjusted, this now happens.
- Clients wanting the name of the element need only access the `name`
  field to get the correct name in all situations.

There's one element which requires special handling, `foreignObject`. In
SVG contexts, its tag name is `foreignObject`, otherwise, it is
`foreignobject`.
- `<!DOCTYPE html><br />` is not be an error.
- `<!DOCTYPE html><p></p/>` is an error.
This tests all of the tree-construction tests except for those marked
`#script-on` and those involving fragments since nokogumbo doesn't
support that yet.

All tests are passing.
@stevecheckoway
Copy link
Collaborator Author

@rubys Quick status update on this PR:
As of this evening,

  • every one of the html5lib-tests that doesn't require scripting and that doesn't involve fragment parsing is parsed successfully by nokogumbo, at least when libxml2's headers are available; and
  • gumbo's unit tests plus the ones added by @craigbarnes and several I added are all passing. I'm aware of a non-error related bug involving invalid HTML. I added a commented out unit test for that.

There are two issues when the headers aren't available

  1. Nokogiri or libxml2 is doing something strange with attributes with the namexml:lang. If you look in at test/test_tree-construction.rb, you can see that I had some trouble detecting that attribute. Hopefully, I can figure out how to work around this.
  2. Something is causing parse to throw TypeError: wrong argument type String (expected Data). I have no idea what that means, but I should be able to debug.

(Also &. doesn't seem to be available on the EOL'd versions of Ruby. I can rewrite it, but I think we should drop those versions anyway since they aren't supported any longer.)

Once these are addressed, I really think we should merge this. The improved testing alone is worth it.

@@ -1,6 +1,8 @@
require 'mkmf'
$CFLAGS += " -std=c99"

CONFIG['warnflags'].gsub!('-Wdeclaration-after-statement', '')
Copy link
Contributor

@craigbarnes craigbarnes Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you get rid of this by compiling with -std=c99?

Edit: hmm, ignore that comment. The line above is literally += " -std=c99".

I wonder why your compiler is warning for something that's explicitly allowed in C99 though...?

@rubys
Copy link
Owner

rubys commented Aug 22, 2018

I have no problem with dropping older versions of Ruby on a major release (e.g. 2.0). I personally don't find &. compelling, but ¯\_(ツ)_/¯. Dropping older versions will allow the Gemfile to be cleaned up.

I don't have time at the moment to fully review each of the changes, but don't wait on me. This all looks like good stuff! Keep up the good work!

Worst case, somebody notices a regression after the release, and has to temporarily pin their application to an older version of nokogumbo until it is resolved. I'm fine with that. My personal priorities at the moment is that the code works on macOS High Sierra and Ubuntu 18.04.

@stevecheckoway stevecheckoway merged commit 64acd00 into rubys:master Aug 22, 2018
@stevecheckoway stevecheckoway deleted the apply-lua-gumbo-fixes branch August 22, 2018 20:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants