-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove ISO-8859-1 charset fallback #2086
Comments
I agree that we should remove the fallback. I do wonder how we should handle
Given that this is not guaranteed to be included in requests, I'm fine with adding it to the toolbelt, that said. I'm also okay with making this a separate package so users can just use that with out having to install the rest of the toolbelt. That, however, is a separate discussion.
We still have a method to do this in |
Upon reading more int RFC 7231, specifically Section 3.1.1.5 I think the third option should ideally be opt-in, not opt-out. My specific reasoning for this is:
Taken from the same section I linked above. |
Agreed from a correctness perspective, but I wonder if @kennethreitz is going to want it from a usability perspective. |
I wonder how easy it would be to prop up a simple app to demonstrate the security risks involved to give a concrete example why not to do it. |
If only |
Correct. =) |
That's how it works now, so I don't think we'd change that. |
Look up all the UTF-7 XSS attacks. (None work in any current browser, as everyone simply dropped UTF-7 sniffing — and most UTF-7 support entirely — to avoid making sites thus vulnerable.) In a very real sense, using chardet is worse than option three above — it will make different conclusions to what any implementation following a specification defining how to sniff the content would (and both HTML and XML provide such a specification). The only safe thing to do is if you don't know how to determine the charset is to not try. You can probably support the vast majority of users by implementing the (standardised) HTML and XML character encoding detection algorithms.
Hmm, the registration (which is included inline in the HTML spec) contradicts the HTML spec itself — per the HTML spec, UTF-8/UTF-16 BOMs are given precedence over the MIME type. I've filed bug 26100 for that. |
Hmmm.... |
http://html5.org/tools/web-apps-tracker?from=8723&to=8724 fixes the IANA registration in the HTML spec to match the body of it. It now reads:
|
Hello, I just got hit by this reading XML files which were encoded as UTF8. On OSX the content type was being returned as 'application/xml' but on linux it was set to 'text/xml' therefore the requests lib assumed its default encoding of 'ISO-8859-1' as 'text' was in the content. Most XML files will be encoded in UTF8 so setting the encoding as 'ISO-8859-1' for 'text/xml' content is surely wrong as discussed. |
Just because an RFC specifies something, doesn't mean we should do it. Especially if it makes the code crazy. I believe that our current behavior is elegant and actually works quite effectively. Is this not the case? As always, what does Chrome do? |
Chrome introspects the HTML, a position we've always decided we don't want to do. We could optionally add support for hooks to do content-type specific encoding heuristics if we wanted. We already kinda do that for JSON, it might not hurt to do it more generally for other content types. |
Grumble. |
Note the HTML case is even worse than that, really. Because the pre-scan in browsers just looks at the first 1024 bytes or there abouts, and the parser itself can then change it. |
I still feel like our current behavior is a great implementation. |
Ok. =) How about I knock up a proposal for smarter encoding heuristics, that takes certain known content-types and attempts to gently introspect them for their encodings. At least then we'll have something concrete to discuss. |
Interesting discussion.
If I may, a counterexample. I use requests to extract the >>> import requests
>>> r = requests.get("https://www.facebook.com/mathiassundin/posts/10153748227675479")
>>> r.encoding
'ISO-8859-1'
>>> import re
>>> m = re.search('<title[^>]*>\s*(.+?)\s*<\/title>', r.text, re.IGNORECASE|re.MULTILINE)
>>> m.group(1)
u'Mathias Sundin - M\xc3\xa5nga roliga samtal mellan Tony Blair och... | Facebook'
>>> r.headers['Content-Type']
'text/html' Apparently they don't specify the encoding in their headers. But they do in the HTML ( As mentioned in #2161, "requests aren't a HTML library", and I can understand that point of view. Perhaps I should be looking into parsing the HTML with some library that also can take the specified encoding into consideration. Thank you for your work on requests. I'm looking forward to 3.0.0. |
@dentarg That's not really a counter example: it's an example of why this system works. The guidance from the IETF is that for all text/* encodings, one of the following things MUST be true:
Requests' default behaviour here works well: in the case of XML and HTML, ISO-8859-1 is guaranteed to safely decode the text well enough to let you see the The behaviour requests has right now is probably less prone to failure with HTML than the one proposed for 3.0.0, and part of me does wonder if we should try to solve this more by documentation than by code change. |
facebook.com doesn't specify the charset in HTTP headers, and then requests fallback to ISO-8859-1 (the fallback might be removed in 3.0.0, see kennethreitz/requests#2086). Use Beautiful Soup which can be intelligent about the encoding[1]. [1]: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Beautiful%20Soup%20Gives%20You%20Unicode,%20Dammit Close #17.
Yes, perhaps documentation is the way forward. I can share my experience. I'm a very new user of requests, and I haven't looked at all the documentation for requests, but I have looked some. The requests website have this in the code snippet on the front page: >>> r.encoding
'utf-8' That information, combined with me somehow (maybe from browsing issues here on GitHub) finding out that requests uses chardet, gave me the impression that requests would solve the charset/encoding problem for me "all the way". Once I found the right documentation, it was straightforward to workaround the limitations with another library. I can fully understand that requests just want to be the HTTP parser, not the HTML parser. All that said, let me share some short examples where I think the ISO-8859-1 fallback might cause some unexpected behavior. >>> import requests
>>> r = requests.get("https://www.facebook.com/mathiassundin/posts/10153748227675479")
>>> from bs4 import BeautifulSoup You can't use >>> print BeautifulSoup(r.text, "html5lib", from_encoding="").title.text
Mathias Sundin - MÃ¥nga roliga samtal mellan Tony Blair och... | Facebook >>> print BeautifulSoup(r.text, "html5lib", from_encoding=r.encoding).title.text
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/bs4/__init__.py", line 215, in __init__
self._feed()
File "/usr/local/lib/python2.7/site-packages/bs4/__init__.py", line 239, in _feed
self.builder.feed(self.markup)
File "/usr/local/lib/python2.7/site-packages/bs4/builder/_html5lib.py", line 50, in feed
doc = parser.parse(markup, encoding=self.user_specified_encoding)
File "/usr/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 236, in parse
parseMeta=parseMeta, useChardet=useChardet)
File "/usr/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 89, in _parse
parser=self, **kwargs)
File "/usr/local/lib/python2.7/site-packages/html5lib/tokenizer.py", line 40, in __init__
self.stream = HTMLInputStream(stream, encoding, parseMeta, useChardet)
File "/usr/local/lib/python2.7/site-packages/html5lib/inputstream.py", line 144, in HTMLInputStream
raise TypeError("Cannot explicitly set an encoding with a unicode string")
TypeError: Cannot explicitly set an encoding with a unicode string You need to use >>> print BeautifulSoup(r.content, "html5lib", from_encoding=r.encoding).title.text
Mathias Sundin - MÃ¥nga roliga samtal mellan Tony Blair och... | Facebook Working as intended: >>> print BeautifulSoup(r.content, "html5lib", from_encoding="").title.text
Mathias Sundin - Många roliga samtal mellan Tony Blair och... | Facebook >>> print BeautifulSoup(r.content, "html5lib", from_encoding="utf-8").title.text
Mathias Sundin - Många roliga samtal mellan Tony Blair och... | Facebook |
So here we hit a problem, which is that we can't really document the way Requests interacts with every possible tool you may want to use it with: there are just too many possibilities! So instead we try to provide general documentation. The best advice I can give is that the more specific the tool, the more likely it can handle bytes as an input and DTRT with them. BeautifulSoup is a HTML tool and so can almost certainly take a stream of arbitrary bytes and find the relevant |
Ref. RFC2616 (HyperText Transfer Protocol), section 3.7.1 (Canonicalization and Text Defaults).
We want to try UTF-8 before falling back to ISO-8859-1. See also: https://github.com/kennethreitz/requests/issues/2086
Sadly, that isn't true, because of UTF-16. In both cases, they should be able to handle BOMs. Furthermore, a conforming XML parser should be able to handle a UTF-16 encoded |
@gsnedders I'm not sure what your point is. ISO-8859-1 will decode incorrectly, but you can still search for the initial |
@Lukasa My point is that you're not necessarily just looking for a |
So agreed that there is more complexity here, but I don't believe that requests can meaningfully take that complexity on. The spectrum of options in this situation is:
I'd argue that of these three options, option 1 is the least user hostile. Essentially, we promise that accessing |
I've had more than a handful of problems from UTF-8 content being read as ISO-8859-1. I've handled this for now by setting Something that I would like to propose for 3.0.0 is a variation of the current behavior (#1) which can be overridden by end-users/consumers/developers with a dedicated hook into the session, like how Edit: I forgot to mention- I also found it very beneficial (at least for debugging) to "tier" how the encoding is accessed. We basically use this logic:
i think a similar approach would be good for |
According to the PEP it was introduced in python 3.1, although it's not mentioned in 3.1 whatsnew. In 3.2 whatsnew it says in passing that tarfile uses it. |
Aside: this might be less important if the user had more control. At the moment, you can set
However, I suppose you can already just do |
@candlerb your suggestions happen after the request has processed. and detecting the correct (better) option is important to do asap because there are scriptable elements of jumping back to the discussion on Jan 9, 2016 between @dentarg and @Lukasa , I think the root issue is that when we're consuming Western content, this approach is fine - but when you're looking at truly global content, it is a complete pain. our indexers were getting tripped up a lot by websites in Asia and the Middle East that don't identity a charset in the headers, and opt for identifying it with a content tag instead. we've had good luck in production with the ability to set a custom i'm 100% sure there is a 10x better method, but having pulled well over 1BN+ pages with requests - many of which are non-western - but i think it's the best approach on this page so far. |
For a long time we've had a fallback value in
response.encoding
ofISO-8859-1
, because RFC 2616 told us to. RFC 2616 is now obsolete, replaced by RFCs 7230, 7231, 7232, 7233, 7234, and 7235. The authoritative RFC on this issue is RFC 7231, which has this to say:The media type definitions for
text/*
are most recently affected by RFC 6657, which has this to say:I checked the registration for
text/html
here. Unsurprisingly, it provides no default values. It does allow a charset parameter which overrides anything in the content itself.I propose the following changes:
text/*
content and use them where appropriate. This isn't vital, just is a "might be nice".The text was updated successfully, but these errors were encountered: