Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug?] unknown encoding ASCII-8BIT #553

Closed
bogdan opened this issue Oct 16, 2011 · 24 comments
Closed

[Bug?] unknown encoding ASCII-8BIT #553

bogdan opened this issue Oct 16, 2011 · 24 comments

Comments

@bogdan
Copy link

bogdan commented Oct 16, 2011

I am using ruby 1.9.3-preview1 and nokogiri 1.5.0.

require "nokogiri"

s = ""
puts Nokogiri::HTML::DocumentFragment.parse("e#{s}").to_s.inspect

Outputs:

output error : unknown encoding ASCII-8BIT
""

It is very strange that if empty string is inserted inline "e#{""}" - there is no error.

libxml 2.7.8.dfsg-2ubuntu0.1

@nobu
Copy link
Contributor

nobu commented Oct 17, 2011

It's [Bug #5126] and should have been backported to 1.9.3rc1.

http://redmine.ruby-lang.org/issues/show/5126

@nurse
Copy link
Contributor

nurse commented Oct 17, 2011

Ruby 1.9.3 RC1 is already released, and it includes the fix.
So please use RC1.

@bogdan
Copy link
Author

bogdan commented Oct 18, 2011

Above code works fine with 1.9.3-rc1.

But the following code fails with same output on both 1.9.2 and 1.9.3-rc1:

require "nokogiri"

s = "ee"
s.force_encoding "ASCII-8BIT"

puts Nokogiri::HTML::DocumentFragment.parse(s).to_s.inspect

@jsqu99
Copy link

jsqu99 commented Nov 8, 2011

Using ruby-1.9.3-p0, @bogdan's code reproduces the error as well.

@wasimakram
Copy link

Yes i am using ruby-1.9.3-p0 and getting the same error.

@bkabrda
Copy link

bkabrda commented Dec 1, 2011

I think I'm getting similar errors during tests:

  1. Failure:
    test_name(Nokogiri::XML::TestNodeEncoding) [/home/bkabrda/rpmbuild/BUILD/rubygem-nokogiri-1.5.0/usr/share/gems/gems/nokogiri-1.5.0/test/xml/test_node_encoding.rb:52]:
    Expected: "US-ASCII"
    Actual: "UTF-8"

  2. Failure:
    test_get_attribute(Nokogiri::XML::TestNodeEncoding) [/home/bkabrda/rpmbuild/BUILD/rubygem-nokogiri-1.5.0/usr/share/gems/gems/nokogiri-1.5.0/test/xml/test_node_encoding.rb:14]:
    Expected: "US-ASCII"
    Actual: "UTF-8"

  3. Failure:
    test_content(Nokogiri::XML::TestNodeEncoding) [/home/bkabrda/rpmbuild/BUILD/rubygem-nokogiri-1.5.0/usr/share/gems/gems/nokogiri-1.5.0/test/xml/test_node_encoding.rb:47]:
    Expected: "US-ASCII"
    Actual: "UTF-8"

  4. Failure:
    test_path(Nokogiri::XML::TestNodeEncoding) [/home/bkabrda/rpmbuild/BUILD/rubygem-nokogiri-1.5.0/usr/share/gems/gems/nokogiri-1.5.0/test/xml/test_node_encoding.rb:57]:
    Expected: "US-ASCII"
    Actual: "UTF-8"

  5. Failure:
    test_encode_special_chars(Nokogiri::XML::TestNodeEncoding) [/home/bkabrda/rpmbuild/BUILD/rubygem-nokogiri-1.5.0/usr/share/gems/gems/nokogiri-1.5.0/test/xml/test_node_encoding.rb:42]:
    Expected: "US-ASCII"
    Actual: "UTF-8"

It's caused when LANG environment variable is set to something strange (like if you do builds and set LANG=C).

@ciniglio
Copy link

My $LANG is set to en_US.UTF-8. Is there a better setting which will make this error go away?

@emad-elsaid
Copy link

i have the same error when parsing the bbc page

http://www.bbc.co.uk/arabic/business/2013/05/130513_us_obama_tax.shtml

using the following code :

require 'open-uri'
url = open('http://www.bbc.co.uk/arabic/business/2013/05/130513_us_obama_tax.shtml')
document = url.to_a.join ''
noko = Nokogiri::HTML.fragment(document)
noko.to_s

it produces a whole lot of

output error : unknown encoding ASCII-8BIT

any ideas how to solve it ?

ruby --version
ruby 1.9.3p194 (2012-04-20 revision 35410) [i686-linux]

@jvshahid
Copy link
Member

Can't reproduce with 1.9.3p448 or jruby 1.7.6. I'll go ahead and close this issue. Please reopen it if you can still reproduce the bug.

Cheers,

@tomnatt
Copy link

tomnatt commented Dec 5, 2013

I am still seeing the exact same problem as blazeeboy when running his test code.

Ruby just installed via rvm on an Ubuntu machine:

$ ruby -v
ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux]

and the same with a ruby 2.0.0 environment:

$ ruby -v
ruby 2.0.0p353 (2013-11-22 revision 43784) [x86_64-linux]

This is with nokogiri 1.6.0 (which seems to be the latest?).

@yagudaev
Copy link

I also got this same output as part of a heroku log, but could not reproduce the issue in the console (but can reliably reproduce in the application). This is caused by a special character “subscribe”. Not sure this is actually nokogiri's fault here.

Using ruby 1.9.3p545 (2014-02-24 revision 45159) [x86_64-linux] (heroku).

@Geesu
Copy link

Geesu commented May 21, 2014

I'm also getting this on 2.0.0p353 - has anyone been able to find a fix?

@Geesu
Copy link

Geesu commented May 21, 2014

Just tried 2.0.0-p481. Same issue :/

@lolmaus
Copy link

lolmaus commented Jun 16, 2014

Getting this on ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-linux] (rvm). :(

This happens when using traverse on a string of HTML containing cyrillic text. The same HTML is used to build a static site with the Middleman gem.

@flavorjones
Copy link
Member

Reopening. Will investigate.

@flavorjones
Copy link
Member

Closing, never could reproduce. Happy to re-open if someone can help me reproduce the problem.

@amatsuda
Copy link

@flavorjones Reproducing the problem is not hard.

require 'nokogiri'

s = String.new 'hello', encoding: Encoding::ASCII_8BIT

p xml: Nokogiri::XML::DocumentFragment.parse(s).to_s
#=> {:xml=>"hello"}
p html: Nokogiri::HTML::DocumentFragment.parse(s).to_s
#=> output error : unknown encoding ASCII-8BIT
#=> {:html=>""}

Seems like native_write_to does not accept 'ASCII-8BIT' encoding?

@flavorjones
Copy link
Member

@amatsuda Thank you for helping me reproduce.

@stephankaag
Copy link
Contributor

Any update available?

@pmackay
Copy link

pmackay commented Jun 2, 2017

I'm getting this only on Heroku on ruby 2.3.3. Is there any way to workaround it?

@yagudaev did you find any more specifics about why it would just happen on Heroku?

@flavorjones
Copy link
Member

@pmackay this is reproducible in ruby 2.4.0. It's definitely not heroku-specific or 2.3.3-specific.

@flavorjones
Copy link
Member

It's also specific to HTML::DocumentFragment. A potential workaround for now is to use HTML::Document or else XML::DocumentFragment if possible.

@envygeeks
Copy link

envygeeks commented Nov 16, 2017

Since HTML::Document.parse deems it necessary to add a document type, and that document type is older than most modern browsers, it's not really a work around any of those. We have no way to truly parse a real fragment without arbitrary warnings, or DOCTYPE.

Is this happening upstream or within Nokogiri?

envygeeks added a commit to envygeeks/jekyll-assets that referenced this issue Nov 16, 2017
@flavorjones
Copy link
Member

This is remarkably similar to #1659 which was fixed by #1674. I imagine there's a similar problem here, if anyone wants to help out and take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests