Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JRuby unable to create HTML document if URL unresponsive #703

Closed
nirvdrum opened this issue Jun 15, 2012 · 10 comments
Closed

JRuby unable to create HTML document if URL unresponsive #703

nirvdrum opened this issue Jun 15, 2012 · 10 comments

Comments

@nirvdrum
Copy link

nirvdrum commented Jun 15, 2012

This seems related to the closed out #674. If I use a script similar to what @flavorjones provided, but instead of providing a valid URL, I use one that doesn't resolve with DNS, I get an error (and an unhelpful message):

#! /usr/bin/env ruby

require 'nokogiri'
require 'open-uri'
require 'yaml'

puts Nokogiri::VERSION_INFO.to_yaml
html = open('http://google.com/').read
doc = Nokogiri::HTML html, 'http://googleaoeu.com/', nil
puts "Doc: #{doc.to_html.inspect}"
puts "Errors: #{doc.errors}"

I get the error:

Errors: [#<Nokogiri::XML::SyntaxError: googleaoeu.com>]

I don't believe the C variant tries to resolve the URL and rather just accepts it as. In any event, it proceeds just fine. As a secondary concern, the error message here isn't much help. It just outputs the hostname and says it's a SyntaxError. Some more context there would be more helpful.

@nirvdrum
Copy link
Author

I should note that this is on 1.5.5 RC2, since previous versions had the problem in #674.

@nirvdrum
Copy link
Author

Okay, this issue is even bigger than I thought. It appears the provided document content is just flat out ignored and the resource specified by the URL is fetched instead. If the URL is changed to be a valid one, the document will use the body from that URL as the document body.

@nirvdrum
Copy link
Author

@jvshahid Since you provided the fix for #674, I was wondering if you might have some insight into the problem. If not, sorry to rope you in.

@jvshahid
Copy link
Member

@nirvdrum I'm happy to look into the issue. I'll get back to you shortly.

@jvshahid
Copy link
Member

@nirvdrum I apologize for my sloppy change, that was my first commit to Nokogiri and I had little context about what was happening. My understanding of the code agrees with your observations, when a url is provided we set the systemId of the InputSource and we ignore the data that is passed to the parser. This is causing the parser to parse the data from the given url instead of the data that was passed to the parser. That also explains why the url has to be resolved.

The fix is easy, but I would like to make sure that the behavior is the consistent with the C Nokogiri before I push the change.

@jvshahid
Copy link
Member

I just pushed some changes to a branch 'fix_entity_resolving' that fixes our entity resolving logic. This should fix your problem and a bunch of other bugs with entity resolving and validation.

@yokolet can you eyeball the changes, I feel it's very major and would like you to look at it. After that we can merge the changes to master.

I'm going to open a new issue regarding handling entity resolving and validation in general. I believe there's more to be done in that area.

@nirvdrum
Copy link
Author

Thanks @jvshahid. I really need to get up to speed with building nokogiri so I can try to be more valuable. My previous attempts were thwarted for one reason or another, but that's not much of an excuse.

@jvshahid
Copy link
Member

@nirvdrum the fix is merged on master and will be packaged in a rc package soon.

@nirvdrum
Copy link
Author

Great! Thanks for looking into it and dealing with the larger issue around entity resolving.

@jvshahid
Copy link
Member

jvshahid commented Jul 4, 2012

Closing the issue, please reopen if the problem still exist. Thanks again for reporting this bug.

@jvshahid jvshahid closed this as completed Jul 4, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants