Skip to content
This repository has been archived by the owner on Aug 26, 2023. It is now read-only.

Change default number of errors for 2.0.0 #78

Closed
stevecheckoway opened this issue Aug 18, 2018 · 13 comments
Closed

Change default number of errors for 2.0.0 #78

stevecheckoway opened this issue Aug 18, 2018 · 13 comments
Assignees
Milestone

Comments

@stevecheckoway
Copy link
Collaborator

stevecheckoway commented Aug 18, 2018

In #65 we decided that 0 was a good default for the maximum number of errors. I'm starting to think that's not a good choice and I think we should change it for 2.0.0. I propose 10 (although any small, nonzero number seems reasonable).

My rationale is that it's useful to be able to check if there were errors at all, even if you don't care about any specific number of them. That is, something like

html = ...
doc = Nokogiri::HTML5.parse(html)
if doc.errors.any?
  puts(doc.errors[0])
end

is nicer than

html = ...
doc = Nokogiri::HTML5.parse(html, max_parse_errors: 1)
if doc.errors.any?
  puts(doc.errors[0])
end

Plus, I think it's unintuitive that if the maximum number of errors is not set, then doc.errors will be empty even if there were parse errors.

10 or 5 errors seem reasonable (and we should document what we choose). It shouldn't lead to a massive amount of memory usage as can happen with an unbounded number of errors.

@rafbm, you raised the original issue. Would 5 or 10 work for you?

@stevecheckoway stevecheckoway added this to the 2.0 milestone Aug 18, 2018
@stevecheckoway stevecheckoway self-assigned this Aug 18, 2018
@rafbm
Copy link
Contributor

rafbm commented Aug 20, 2018

Would 5 or 10 work for you?

Since I personally have no interest in tracking errors, I would start adding max_parse_errors: 0 everywhere I call Nokogumbo. The overhead may be tiny, but still, I would see this as a “free trick for small perf win”. Kind of like adding # frozen_string_literal: true to all files. When you know the trick exists, why not opt-in?

Now, when I imagine users who do want to track errors, they most likely have a specific need like writing an actual HTML validator. Such user will likely want to increase the number to something like 50 or 100. I can imagine someone doing development / testing with small HTML documents, always being under the 10 errors default, but then when running in production, end users get weird behavior and report that “not all errors are detected”. If it was 100, then it’s more likely end users will figure that must be a hard application limit.

That’s sums my thoughts. I tend to think 5-10 can’t be optimal for anyone. It’s either too much (unneeded) or not enough.

Plus, I think it's unintuitive that if the maximum number of errors is not set, then doc.errors will be empty even if there were parse errors.

That’s a good point. A solution could be to raise something like Nokogumbo::ErrorsNotTracked: Specify a :max_parse_errors argument to track errors when calling #errors.

In any case, there should be small “Tracking parse errors” section in the README, which is thankfully very short. So most people will likely see this bit of documentation and know what to do with it.

@stevecheckoway
Copy link
Collaborator Author

@rafbm I'd prefer not to add an exception.

I'm not sure I agree with your rationale that 5–10 is not optimal for anyone. Maybe I'm alone in thinking this, but when a compiler, for example, shows me hundreds of errors, I don't find that particularly useful. I only need to see the first few errors which I'm then going to fix. I think a small number of errors is good for a human, a large number is probably good for a machine to act on.

I guess there are two questions to address:

  • What should happen if there are more errors than max_parse_errors?
  • What should the default be for max_parse_errors?

For the first question, the options I see are

  1. Do nothing.
  2. Indicate there are more errors by some #has_more_errors api.
  3. Insert an error in #errors indicating there are more errors.
  4. If max_parse_errors is 0, then raise an exception for #errors.

For the second question, the options are
A. Default to 0.
B. Default to a small number, say 5–10.
C. Default to a medium-sized number, say 100–200.
D. Default to an unbounded number.

  • The current situation (1A) is confusing to users and I think we should change.
  • I think 2A is likely to be as confusing as 1A since it's a new api.
  • 3 might be surprising, but I don't have a strong opinion on it.
  • I don't like 4 because it adds a new exception that isn't like Nokogiri's behavior at all.
  • D is probably pretty unfriendly and easy to cause a DoS by giving an exceptionally bad document.

That leaves 1B, 1C, 2B, 2C, 3A, 3B, 3C.

@rubys
Copy link
Owner

rubys commented Aug 20, 2018

I agree that doc.errors will be empty even if there were parse errors is confusing. Changing this so that doc.errors is nil when max_parse_errors is 0 should fix this (independent of whether the default for max_parse_errors should change).

I think it would be hard to defend any default for max_parse_errors that is other that 0, 1, or effectively unbounded. My mild preference is for zero, as I don't trust gumbo. Try to guess which one of the following is an error before you try it:

Nokogiri::HTML5('<!DOCTYPE html><p>', max_parse_errors: 1).errors.join
Nokogiri::HTML5('<!DOCTYPE html><br>', max_parse_errors: 1).errors.join
Nokogiri::HTML5('<!DOCTYPE html><br/>', max_parse_errors: 1).errors.join
Nokogiri::HTML5('<!DOCTYPE html><p><br/>', max_parse_errors: 1).errors.join

@stevecheckoway
Copy link
Collaborator Author

A default of 1 seems good. I don't like having #errors sometimes be an array and sometimes be nil. Same principle as here. But maybe idiomatic Ruby does this?

My guess is the third one is an error. I admit, I can't understand what the error is.

There are three tokens, a DOCTYPE token, a start tag token whose tag name is BR with the self-closing flag set, and an end-of-file token. The insertion modes should be

  1. https://html.spec.whatwg.org/multipage/parsing.html#the-initial-insertion-mode "A DOCTYPE token"
  2. https://html.spec.whatwg.org/multipage/parsing.html#the-before-html-insertion-mode "Anything else"
  3. https://html.spec.whatwg.org/multipage/parsing.html#the-before-head-insertion-mode "Anything else"
  4. https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inhead "Anything else"
  5. https://html.spec.whatwg.org/multipage/parsing.html#the-after-head-insertion-mode "Anything else"
  6. https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody "A start tag whose tag name is one of: 'area', 'br', 'embed', 'img', 'keygen', 'wbr'
  7. https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody "An end-of-file token"

One consumes the DOCTYPE token whereas two through five insert HTML, HEAD, nothing, and BODY, respectively, without consuming anything. Six inserts BR and acknowledges the self-closing flag. Seven stops parsing.

But gumbo reports an error with the stack of open elements being just html. Is this a bug in gumbo or am I misunderstanding the HTML spec?

Anyway, I think my preference among your options is to default to 1 and always have #errors return a (possibly empty) array. If users set max_parse_errors: 0, then they should know they're ignoring all errors.

@stevecheckoway
Copy link
Collaborator Author

I think this is a bug in gumbo. An informational portion of the standard says

The trailing U+002F (/) in a start tag name can be used only in foreign content to specify self-closing tags. (Self-closing tags don't exist in HTML.) It is also allowed for void elements, but doesn't have any effect in this case.

br is a void element.

In fact, gumbo seems to miss self-closing end tags entirely! I think the fix for both is not too hard though.

@rubys
Copy link
Owner

rubys commented Aug 20, 2018

My observation is that it would not be difficult to find bugs in Gumbo in this area. My personal feeling is that this is an experimental feature, and as such users should opt in to get errors. That being said, I don't strongly object to 1 as a new default.

Also +1 to there should be small “Tracking parse errors” section in the README

@stevecheckoway
Copy link
Collaborator Author

How about we make 1 the new default, add a parse errors section where we note that it's experimental (and we welcome bug reports regarding missing or incorrect errors)?

@stevecheckoway
Copy link
Collaborator Author

Here's the fix for those self-closing errors. #84 needs to be merged first though.

@stevecheckoway
Copy link
Collaborator Author

Okay, after having spent a bunch of time integrating the html5lib tests, I agree that the error situation here is not great. I'll leave 0 as the default and add some documentation.

@rafbm
Copy link
Contributor

rafbm commented Aug 22, 2018

I realize it’s kinda late and I know I’m the one who introduced :max_parse_errors. But reading the documentation bit you added, it strikes me that :max_parse_errors could have been reduced to :max_errors. Since the getter is named @errors and not @parse_errors, it would arguably better align with it. It would also make usage of the error tracking feature terser.

@stevecheckoway
Copy link
Collaborator Author

That's not a bad idea. Maybe supporting both for a few releases but only document :max_errors?

@rafbm
Copy link
Contributor

rafbm commented Aug 22, 2018

I let you see if you think backward compatibility is needed, given documentation is only ~30 minutes old.

@stevecheckoway
Copy link
Collaborator Author

Oh, I wasn't worried about the documentation. But we've mentioned :max_parse_errors in response to other issues so at least some people are using it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants